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[57] ABSTRACT 

Techniques for generating sophisticated representations of 
the contents of both queries and documents in a retrieval 
system by using natural language processing (NLP) tech- 
niques to represent, index, and retrieve texts at the multiple 
levels (e.g., the morphological, lexical, syntactic, semantic, 
discourse, and pragmatic levels) at which humans construe 
meaning in writing. The user enters a query and the system 
processes the query to generate an alternative representation, 
which includes conceptual-level abstraction and representa- 
tions based on complex nominals (CNs), proper nouns 
(PNs), single terms, text structure, and logical make-up of 
the query, including mandatory terms. After processing the 
query, the system displays query information to the user, 
indicating the system's interpretation and representation of 
the content of the query. The user is then given an oppor- 
tunity to provide input, in response to which the system 
modifies the alternative representation of the query. Once the 
user has provided desired input, the possibly modified 
representation of the query is matched to the relevant 
document database, and measures of relevance generated for 
the documents. A set of documents is presented to the user, 
who is given an opportunity to select some or all of the 
documents, typically on the basis of such documents being 
of particular relevance. The user then initiates the generation 
of a query representation based on the alternative represen- 
tations of the selected documents). 
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USER INTERFACE AND OTHER Retrieval technologies used in this industry share many 

ENHANCEMENTS FOR NATURAL common features. For example, a user of these systems is 

LANGUAGE INFORMATION RETRIEVAL typically required to either (1) state an information need, or 

SYSTEM AND METHOD query* in a circumscribed manner, usually by demarcing the 

5 logical requirements of the query as a sequence of terms 

CROSS-REFERENCE TO RELATED linked by various operators, or (2) write the query as 

APPLICATIONS free-form text, which is then parsed automatically into a 

w . . , . . . m • sequence of words or phrases, without regard for the logical 

Ibis application claims priority irom, and is a formofthe 

query or the underlying meaning of the query. In 

contiDuaUon-in-part of the following U.S^ Provisional M either event the query is represented only by the coUection 

Patent Applications, all filed Aug 16, 1995, the duclosures of WOfds ^ aK Qvertl ^ . Q ^ ^ (of limited 

of which are hereby incorporated by reference: s , emmed forms o{ some WQlds> ^ „ plurals) . ^ match . 

1. No. 60/002,451, of Elizabeth D. Liddy, entitled THE Q f documents to a query is based on the co-occurrence 
PROMISE OF NATURAL LANGUAGE PROCESS- Q f these words or phrases 

ING FOR COMPETITIVE INTELLIGENCE; 15 A second commonality among retrieval systems is that a 

2. No. 60/002,452, of Elizabeth D. Liddy and Sung H. query representation derived from a user's query statement 
Myaeng, entitled DR-LINK SYSTEM: PHASE I ^ automatically formed by the computer system, with lim- 
SUMMARY; i te d or no interaction with the user. In most retrieval systems, 

3. No. 60/002,453, of Elizabeth D. Liddy, Edmund S. Yu, once an initial query statement has been made in full, the 
Mary McKenna, and Ming Li, entitled DETECTION, 20 computer system interprets the contents of the query without 
GENERATION AND EXPANSION OF COMPLEX allowing the user to verify, clarify or expand upon query 
NOMINALS; representations created by the computerized retrieval sys- 

4. No. 60/002,470, of Elizabeth D. Liddy, Woojin Paik, tam. In toe same fashion, the subsequent display of retrieved 
and Mary McKenna, entitled DEVELOPMENT OF A documents is largely under computer control, with little user 
DISCOURSE MODEL FOR NEWSPAPERS; 25 interaction. 

5. No. 60/002,471, of Elizabeth D. Liddy, Woojin Paik, In view of these common characteristics of computer- 
Edmund S. Yu, E. S. and Mary McKenna, entided based retrieval systems, their inability to capture both the 
DOCUMENT RETRIEVAL USING LINGUISTIC preciseness and richness of meaning in queries and 
KNOWLEDGE; and documents, and their inability to interact with the user to 

6. No. 60/002,472, of Woojin Paik, Elizabeth D. Liddy, 3 ° *| el P formulate a query statement and present retrieved 
Edmund Yu, and Mary McKenna, entitled CATEGO- documents, retrieval a often an inexact process. 

RIZING AND STANDARDIZING PROPER NOUNS «jiimmaby nv thp ikvpimtion 

FOR EFFICIENT INFORMATION RETRIEVAL. SUMMARY OF THE INVENTION 

The following applications, including this one, are being 35 The present invention provides techniques for generating 

filed concurrently, and the disclosure of each other applica- sophisticated representations of the contents of both queries 

tion is incorporated by reference into this application: and documents in a retrieval system by using natural lan- 

patent application Sen No. 08/696,701, entitled "MUL- guage processing (NLP) techniques to represent, index, and 

TILINGUAL DOCUMENT RETRIEVAL SYSTEM retrieve texts at the multiple levels (e.g., the morphological, 

AND METHOD USING SEMANTIC VECTOR 40 lexical, syntactic, semantic, discourse, and pragmatic levels) 

MATCHING," to Elizabeth D. Liddy, Woojin Paik, at which humans construe meaning in writing. The invention 

Edmund S. Yu, and Ming Li (Attorney Docket No. also offers the user the ability to interact with the system to 

17704-2.00); confirm and refine the system's interpretation of the query 

patent application Ser. No. 08/698,472, entided "NATU- conlent > ^ at ' miiial <* uer y Pressing step and after 

RAL LANGUAGE INFORMATION RETRIEVAL « query matching has occurred. 

SYSTEM AND METHOD," to Elizabeth D. Liddy, According to one aspect of the invention, the user enters 

Woojin Paik, Mary McKenna, and Ming Li (Attorney * query, possibly a natural language query, and the system 

Docket No. 17704-3.00); and processes the query to generate an alternative representation, 

patent application Sen No. 08/696,702, entitled "USER ^ alternative representation may include conceptual-level 

INTERFACE AND OTHER ENHANCEMENTS FOR 50 abstraction ^ enrichment of the query, and may include 

NATURAL LANGUAGE INFORMATION other representations. In a specific embodiment, the 

RETRIEVAL SYSTEM AND METHOD," to Elizabeth conceptual-level representation is a subject field code vector, 

D. Liddy, Woojin Paik, Mary McKenna, Michael while the other representations include one or more of 

Weiner, Edmund S. Yu, Ted Diamond, Bhaskaran representations based on complex nominals (CNs), proper 

Balakrishan, and David Snyder (Attorney Docket No. 55 nouns (PNs), single terms, text structure, and logical make- 

17704-4 00) up of the query, mcluding mandatory terms. After processing 

the query, the system displays query information to the user, 

BACKGROUND OF THE INVENTION indicating the system's interpretation and representation of 

the content of the query. The user is then given an oppor- 

The present invention relates generally to the field of 60 tunity to provide input, in response to which the system 

computer-based information retrieval, and more specifically modifies the alternative representation of the query. Once the 

to the application of natural language processing (NLP) user has provided desired input, the possibly modified 

techniques to the interpretation and representation of com- representation of the query is matched to the relevant 

puter text files, and to the matching of natural language document database, and measures of relevance generated for 

queries to documents with the aid of user interactions. 65 the documents. The documents in the database have prefer- 

Computer-based information retrieval is now an estab- ably been processed to provide corresponding alternative 

lished industry serving many professional communities. representations for matching to queries. 
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According to a further aspect of the invention, a set of typically determined by a relevance score based on the 

documents is presented to the user, who is given an oppor- various elements of the alternative representation. However, 

tunity to select some or all of the documents, typically on the the user can modify both the viewing order and the sorting 

basis of such documents being of particular relevance. The order based on one or more of the following: conceptual 

user then initiates the generation of a query representation 5 level subject content codes; the presence or absence of 

based on the alternative representations of the selected various proper nouns, including personal names, company 

documents). To the extent that the set of documents were names, countries, cities, titles, etc.; the presence or absence 

retrieved in response to a previous query, the alternative of various terms or phrases; the text structure of the 

representations of the selected documents may be combined document, such as the time frame, or the presence of various 

with the alternative representation of the previous query. 10 requirements such as analytic information, cause/effect 

Thus the user is able to improve on an initial query repre- dimension, predictions, etc.; the presence or absence of 

sentation by re-expressing the query as a composite of the negated expressions; the document date or range of dates; 

representations derived from documents deemed highly rel- the document source; the document author; the document 

evant by the user, possibly combined with the representation language; and a similarity score criterion for the document, 

of the original query. is A further understanding of the nature and advantages of 

According to a further aspect of the invention, a natural the present invention may be realized by reference to the 

language query is processed to generate a logical represen- remaining portions of the specification and the drawings, 
tation of terms in the query. The system recognizes words 

that indicate negation, and divides the terms in the query as BRIEF DESCRIPTION OF THE DRAWINGS 

to whether such terms belong to the positive or negative 20 ^ ± fe ^ ^ ^ ^ ^ informatioD 

portion ot tbe query In recognition c* toe tact tHat a m em5odying the ~nt invention; 

document dealing with the negative portion of the query ' ^ m . , . . 

may contain information relevant to the positive portion, the u FIG ' 2 » a more det J ail f 1 block diagram of the interactions 

system is designed to incorporate the terms in the negative between the user and the system during text processing 

portion of the query into the alternative representation of the 25 portion for information retrieval; 

query. However, in further recognition that the user explic- FIG. 3 is a block diagram of the document processing 

itly specified certain types of subject matter as not being of portion of the system; 

interest, documents satisfying both the positive and negative FIG. 4 shows the document indexing structure for terms; 

portions of the query are segregated from documents meet- FIG. 5 a block diagram of the query processing portion of 

ing only the positive portion of the query. 30 m e system; 

According to a further aspect of the invention, a natural fig. 6 is a tree-form logical representation of a query 

language query is processed to generate a logical represen- statement; 

tation of terms in the query. The system recognizes words nG ? m le of clusterin 

that indicate a mandatory requirement of the query, and the n , . . . , 

presence of mandatory terms in a document can be given 35 FIG - 8 15 a shot ^ owm S the general features 

greater weight in matching. Whether or not the presence of S^v" to m0St SCTCeQS UScd m ^ graphic USer mterface 

mandatory terms enters into the score, documents containing v^W)* 

the mandatory terms are identified and preferably segregated FIGS - 9A and 9B > taken together, form a flow diagram 

from documents that do not contain all the mandatory terms. showing the GUI-based interactive process of text retrieval; 

According to a further aspect of the invention, texts 40 FIG. 10 is the sign-on screen; 

(documents and queries) are processed to determine dis- FIG. 11 is the query screen; 

course aspects of the text beyond the subject matter of the FIG. 12 is the database selection screen; 

text. This text structure includes temporal information (past, vin 1 ^ . f . A , # . A _i f - 

t c . \ j • . *• -r / FIG. 13 is the date and/or time selection screen; 

present, and future), and intention information (e.g., AC m „ , x 

analysis, prediction, cause/effect). Thus the invention is able 45 nG ' 14Als the ^ <W processing (QP) review screen; 

to detect the higher order abstractions that exist in human FIG. 14B k a ^^ail of FIG. 14A, showing the arrange- 

communications that are above the word level, such as the ment of terms in the QP review screen; 

difference between a statement describing an expected FIG. 15 is the retrieved documents view screen 

outcome, the consequence of a particular event (prediction), 5Q (document headlines in folders); 

or a statement that described a past event. Since the system FIG. 16 is the retrieved documents view screen (foldcred 

is able to recognize the manifold intentions in a query, it can by subject field code); 

orwateonagrcatersco^ FIG. 17 is a retrieved document view screen (summary 

user having to pre-specify where one suspects the answers representation of a document); 

(documents) might be. 55 18 is a retrieved document view screen (full text of 

The sophistication of the text representation used in the a document)* 
invention means that certain discourse that exophorically FIG. 19 is the More Like Marked (MLM) screen showing 
references tables, graphs photographs or other images can mafked documents fa fold md 
also be used to search with great efficiency for such images. ^ A . , T _ . ' . „„ m _ 
For example, the captions used to label photographs has a 60 . FIG ' 15 f the ™<™ Lflaf Marked (MLM) screen show- 
certain discourse structure, and this structure can be used to mg results of a MLM - based TO 
help effectively search for graphic items. DESCRIPTION OF SPECIFIC EMBODIMENTS 

According to a further aspect of the invention, the system 1.0 Introduction 

automatically sorts, ranks and displays documents judged This application describes a computer system used for 

relevant to the content of the query, using a multi-tier system 65 information retrieval that, through a sequence of computer 

of folders containing ranked lists of documents. The inclu- and user interactions, allows the expression and clarification 

sion of a document and its position within a folder are of complex query statements and the retrieval and display of 
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relevant documents using natural language processing Client 25 has the same general configuration, although 

(NLP) techniques. The system incorporates aspects typically with less storage and processing capability. Thus, 

described in a paper by Liddy et al. [Liddy94a]. The system while the client computer could be a terminal or a low-end 

is referred to in the paper as DR-LINK (Document Retrieval personal computer, the server computer would generally 

using Linguistic Knowledge), and will also sometimes be 5 need to be a high-end workstation or mainframe. Corre- 

referred to as DR-UNK in this application. sponding elements and subsystems in the client computer 

This application is divided into two parts. In the first part, are shown with corresponding, but primed, reference numer- 

a detailed description is given of the underlying software als. 

processing that facilitates NLP-based text retrieval. In the The user interface input devices typically includes a 

second part, a description is given of the graphic user 10 keyboard and may further include a pointing device and a 

interface (GUI) and the sequence of interactions that occur scanner. The pointing device may be an indirect pointing 

between the software processing system and the user. device such as a mouse, trackball, touchpad, or graphics 

Unless otherwise stated, the term "document" should be tablet, or a direct pointing device such as a touchscreen 

taken to mean text, a unit of which is selected for analysis, incorporated into the display. Other types of user interface 

and to include an entire document, or any portion thereof, 15 input devices, such as voice recognition systems, are also 

such as a tide, an abstract, or one or more clauses, sentences, possible. 

or paragraphs. A document will typically be a member of a The user interface output devices typically include a 

document database, referred to as a corpus, containing a printer and a display subsystem, which includes a display 

large number of documents. Such a corpus can contain controller and a display device coupled to the controller. The 

documents in any or all of the plurality of supported lan- 20 display device may be a cathode ray tube (CRT), a flat-panel 

guages. device such as a liquid crystal display (LCD), or a projection 

Unless otherwise stated, the term "query" should be taken device. Display controller provides control signals to the 

to mean text that is input for the purpose of selecting a subset display device and normally includes a display memory for 

of documents from a document database. While most que- storing the pixels that appear on the display device. The 

ries entered by a user tend to be short compared to most 25 display subsystem may also provide non-visual display such 

documents stored in the database, this should not be as audio output. 

assumed. The present invention is designed to allow natural The memory subsystem typically includes a number of 

language queries. memories including a main random access memory (RAM) 

Unless otherwise stated, the term "word" should be taken for storage of instructions and data during program execu- 

to include single words, compound words, phrases, and 30 tion and a read only memory (ROM) in which fixed instruc- 

other multi-word constructs. Furthermore, the terms "word" tions are stored. In the case of Macintosh-compatible per- 

and "term" are often used interchangeably. Terms and words sonal computers the ROM would include portions of the 

include, for example, nouns, proper nouns, complex operating system; in the case of IBM-compatible personal 

nominals, noun phrases, verbs, adverbs, numeric computers, this would include the BIOS (basic input/output 

expressions, and adjectives. This includes stemmed and 35 system). 

non-stemmed forms. The file storage subsystem provides persistent (non- 
The disclosures of all articles and references, including volatile) storage for program and data files, and typically 
patent documents, mentioned in this application are incor- includes at least one hard disk drive and at least one floppy 
porated herein by reference as if set out in full. disk drive (with associated removable media). There may 
1.1 System Hardware Overview 40 also be other devices such as a CD-ROM drive and optical 
FIG. lis a simplified block diagram of a computer system drives (all with their associate removable media). 
10 embodying the text retrieval system of the present Additionally, the system may include drives of the type with 
invention. The invention is typically implemented in a removable media cartridges. The removable media car- 
client-server configuration including a server 20 and numer- tridges may, for example be hard disk cartridges, such as 
ous clients, one of which is shown at 25. The use of the term 45 those marketed by Syquest and others, and flexible disk 
"server" is used in the context of the invention, where the cartridges, such as those marketed by Iomega. One or more 
server receives queries from (typically remote) clients, does of the drives may be located at a remote location, such as in 
substantially all the processing necessary to formulate a server on a local area network or at a site on the Internet's 
responses to the queries, and provides these responses to the World Wide Web. 

clients. However, server 20 may itself act in the capacity of 50 In this context, the term "bus subsystem" is used generi- 

a client when it accesses remote databases located on a cally so as to include any mechanism for letting the various 

database server. Furthermore, while a client-server configu- components and subsystems communicate with each other 

ration is known, the invention may be implemented as a as intended. With the exception of the input devices and the 

standalone facility, in which case client 25 would be absent display, the other components need not be at the same 

from the figure. 55 physical location. Thus, for example, portions of the file 

The hardware configurations are in general standard, and storage system could be connected via various local-area or 

will be described only briefly. In accordance with known wide -area network media, including telephone lines, 

practice, server 20 includes one or more processors 30 that Similarly, the input devices and display need not be at the 

communicate with a number of peripheral devices via a bus same location as the processor, although it is anticipated that 

subsystem 32. These peripheral devices typically include a 60 the present invention will most often be implemented in the 

storage subsystem 35 (memory subsystem and file storage context of PCs and workstations. 

subsystem), a set of user interface input and output devices Bus subsystem 32 is shown schematically as a single bus, 

37, and an interface to outside networks, including the but a typical system has a number of buses such as a local 

public switched telephone network. This interface is shown bus and one or more expansion buses (e.g., ADB, SCSI, ISA, 

schematically as a "Modems and Network Interface" block 65 EISA, MCA, NuBus, or PCI), as well as serial and parallel 

40, and is coupled to corresponding interface devices in ports. Network connections are usually established through 

client computers via a network connection 45. a device such as a network adapter on one of these expansion 
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buses or a modem on a serial port. The client computer may 1.3 GUI Interaction Overview 

be a desktop system or a portable system. FIG. 2 is a more detailed block diagram of the text 

The user interacts with the system using user interface processing portion of the system, showing the nature of the 

devices 37' (or devices 37 in a standalone system). For interactions between the user and the system. In the figure, 

example, client queries are entered via a keyboard, commu- 5 processing engine block 50 has been broken into document 

nicated to client processor 30*, and thence to modem or processing engines 50D, collectively referred to as the 

network interface 40' over bus subsystem 32'. The query is document processor, and query processing engines 50Q, 

then communicated to server 20 via network connection 45. collectively referred to as the query processor (QP). Each 

Similarly, results of the query are communicated from the has lts own resources, shown as document processor 

server to the client via network connection 45 for output on 10 f esour f s *2 *? d # W P ro 5 e f° r resources should 

one of devices 37 (say a display or a printer), or may be £^£f^ ^ 80100 ° f fCSOUrCCS Cm ** 

iTx d ^o St ° rage SU ^ y l em . GUI70isshownasasingleblockwithinputsandoutputs, 

1.2 Text Processing (Software) Overview as well as links to matcher 55, QP 50Q, and an additional 

The server s storage subsystem 35, as shown in FIG. 1, module 77> « Morc jjke Marked" (MLM). As well as 
maintains the basic progra mmin g and data constructs that 15 prov iding exceptionally rich and powerful document and 
provide the functionality of the DR-LINK system. quer y representations, user interface enhancements allow 
DR-LINK software is designed to (1) process text stored in the user to interact with the retrieval process, 
digital form (documents) or entered in digital form on a Documents are shown as being input to document pro- 
computer terminal (queries) to create a database file record- cessor 50D, which outputs a set of tagged documents 72 and 
ing the manifold contents of the text, and (2) match discrete 20 a document index file 75, which stores alternative represen- 
texts (documents) to the requirements of a user's query text. tations of the documents for use by matcher 55. Similarly, 
DR-LINK provides rich, deep processing of text by repre- queries are shown as being input to GUI 70, and commu- 
senting and matching documents and queries at the lexical, nicated to query processor 50Q, which generates an alter- 
syntactic, semantic and discourse levels, not simply by native representation of the query for use by the matcher. As 
detecting the co-occurrence of words or phrases. Users of 25 DOted above, and will be described in detail below, the 
the system are able to enter queries as fully-formed alternative representation for a document or a query typi- 
sentences, with no requirement for special coding, annota- call y ^ ludcs P™"* ^ of information that the 
tion or the use of logical operators. s y stem has S enerated based on the content of the document 

The system is modular and performs staged processing of or query, 

documents, with each module adding a meaningful annota- 30 Matcher 55 executes the query by comparing the query 

tion to the text. For matching, a query undergoes analogous representation to ^ document representations, and provides 

processing to determine the requirements for document J**" 1 * <° GUI 70 for **|^ *nd other action However, 

matching. The system generates both conceptual and term- b * fore ihc ^ uer > representation is sent to the matcher, results 

based representations of the documents and queries. It is of the . ^processing (mdicatmg the query representation) 

convenient to refer to the collection of various representa- 35 arc displayed for the user. This provides the user an oppor- 

tions which the system produces for each document or for mmt y to P^dc input specifymg modification of the query 

each query as "the alternative representation" for that docu- representation. This user feedback is showr , schematically as 

ment or query. Put another way, a reference to "the alterna- a ^i-elliptical arrow m ^ P mo * fies the 

live representation," should be taken to encompass a single ^ ^presentation accordingly before sending the query 

representation, or any or all of the plurality of representa- 40 ^Presentation to matcher 55. 

^ ons Once the query, possibly modified, is executed, the search 

The processing modules include a set of processing result s « ^played to the user. The user is then able to 

engines, shown collectively in a processing engine block 50, P r0Vlde Redback to the system by marking documents that 

and a query-document matcher 55. It should be understood, a L rc C0 ? sldBIcd Particularly relent The representations of 

however, that by the time a user is entering queries into the 45 ^ documents are then used by MLM module 77 to create 

system, the relevant document databases will have been a ^ CT rcvised W for execution. This feedback based 

processed and annotated, and various data files and data 00 document relevance is referred to as relevance feedback, 

constructs will have been established. These are shown 2.0 Document Processing 

schematically as a "Document Database and Associated 2.1 Dccument Processing Overview 

Data" block 60, referred to collectively below as the docu- 50 FIG * 3 * * Wock diagram showing the document pro- 

ment database. An additional set of resources 65, possibly cessin S m 4 odules wimin document processor 50D, and some 

including some derived from the corpus at large, is used by associated resources. The set of modules that perform the 

the processing engines in connection with processing the Processing to generate the conceptual representation and the 

documents and queries. Alternatively, documents can be term-based representation of each document includes: 

processed and annotated on the fly as they arrive in real time. 55 a preprocessor 80; 

User interface software 70 allows the user to interact with a part of speech (POS) tagger 90, with its associated POS, 
the system. The user interface software is responsible for end of sentence detection, and lexical clue databases; 
accepting queries, which it provides to processing engine 50. a subject field coder (SFC) 100, with its associated 
The user interface software also provides feedback to the concept category database containing a hierarchy of 
user regarding the system's interpretation of the query, and 60 concept categories for all words, domain knowledge 
accepts responsive feedback from the user in order to concept category correlation matrix database used to 
reformulate the query. The user interface software also disambiguate concept categories at the domain level, 
presents the retrieved documents as a result of the query to and global knowledge concept category sense- 
the user and reformats the output in response to user input. frequency database used to disambiguate concept cat- 
User interface software 70 is preferably implemented as a 65 egories at the global level; 

graphical user interface (GUI), and will often be referred to a proper noun (PN) categorizer (PNC) 110, with its 

as the GUI. associated proper noun bracketer database used to 
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bracket PNs with embedded conjunctions and/or Each information bearing word in a text is looked up in 
prepositions, proper noun category databases used to . the online, lexical resource. If the word is in the lexicon,. it 

categorize PNs, proper noun prefix/suffix database used is assigned a single, unambiguous subject code using, if 

to identify PN categories by reference to the suffix or necessary, a process of disambiguation. Once each content- 
prefix, and proper noun clarification database which 5 bearing word in a text has been assigned a single SFC, the 

presents alternative proper names based on what user frequencies of the codes for all words in the document are 

has typed in the query; combined to produce a fixed length, subject-based vector 

a complex nominal (CN) detector 120, representation of the document's contents. This relatively 

single term detector 130, with its associated numeric high-level, conceptual representation of documents and que- 
information database used to identify and catalog 10 ries » an important representation of texts used for later 

numeric data types (currency, temperature, etc.); matching and ranking. 

a text structurer 140, with its associated text structure Polysemy (the ability of a word to have multiple 

evidence phrase database used to gather evidence for a meanings) is a significant problem in information retrieval, 

particular text structure; and sinoe words m me English language have, on average, about 

a term indexer 150 * ^ senses > mc most commonly occurring nouns 
In the course of operation, SFC 100 and term indexer 150 15 ^ving an average of 7.3 senses, and the most commonly 

write document information into database index file 75, occurring verbs having an average of 12.4 senses 

which as mentioned above, is used for query matching. [GentnerSl], a process of disambiguation is involved in 

2.2 Document Preprocessor 80 '^SH* a single subject field code to a word. 
Document preprocessor 80 transforms raw digital data ^ S mQamn & ( f d h * nCC ^P^P 05 " 

files of text into a uniform format suitable for further 20 *?* fi«W code ^assignments) are ; disambiguated to a 

processing by the DR-LINK system. Preprocessing involves sm & e sub J ect field md * ""ee evidence sources (this 

some discourse-level manipulation of text, such as the method of disambiguation has general application in other 

explicit decomposition of composite documents into appro- text processing modules to help improve performance): 

priate sub-texts. All text is annotated with pseudo-SGML 2- 4 - 1 L 00 * 1 Context. If a word in a sentence has a single 

tags. Preprocessing tags include, but are not limited to, fields 25 subject code tag, it is Unique. If there are any subject codes 

such as <caption>, <date>, <headline>, <sub-text headline>, that have been assigned to more than a pre-determined 

and <sub-text>, <Fig.> and <table>. The preprocessor fur- number of words in a sentence, then the codes are Frequent 

ther identifies various fields, clauses, parts-of-speecb and Codes. These two types of codes are used as anchors to 

punctuation in a text, and annotates a document with iden- disambiguate the remaining words in a sentence that share 

tifying tags for these units. The identification process occurs 30 the same codes. 

at the sentence, paragraph and discourse levels and is a 2.4.2 Domain Knowledge. Certain subject codes are 

fundamental precursor to later natural language processing highly correlated with other codes within a given domain, 

and document-query matching. This strong association is used to disambiguate polysemous 

2.3 Part-of-Speech (POS) Tagger 90 words that cannot be disambiguated using local context. 
In a current implementation, documents are first pro- 35 2.4.3 Global Knowledge. If words cannot be disambigu- 

cessed using a custom End-of-Sentence detection program, ated in steps 1 or 2, then the most frequently used sense of 

followed by a commercial off-the-shelf (COTS) probabilis- a word is invoked. 

tic part-of-speech (POS) tagger of the type provided by such The fixed-length vector representation of the subject 

companies as Inso Corporation, Boston, Mass. The POS contents of a text is stored in database index file 75 along 

tagger identifies over 47 grammatical forms and punctuation 40 with other index representation of the text, 

marks. In addition, hyphenated words are often given mul- 2.5 Proper Noun Detector and Categorizer (PNC) 110 

tiple tags — each constituent word is given a tag, and the Proper nouns, group proper nouns (e.g., the Far East) and 

whole hyphenated phrase is given a tag. The preferred group common nouns (e.g., anti-cancer drugs) are recog- 

implementation performs additional processing of text, nized as important sources of information for detecting 

numerals, and other markings and attributes beyond that of 45 relevant documents in information retrieval [Paik93a]. PNC 

the commercial POS tagger (see discussion of additional 110 first locates the boundaries of proper noun phrases using 

modules below). the POS tags mentioned earlier, and other text analysis tools. 

2.4 Subject Field Coder (SFC) 100 Heuristics developed through corpus analysis are applied to 
Using the text output from the POS tagger, the SFC 100 bracket proper noun phrases which contain embedded con- 
tags content-bearing words in a text with a disambiguated 50 junctions and prepositions (e.g., Department of Defense, 
subject code using an online lexical resource of words Centers for Disease Control and Prevention). 

whose senses are grouped in subject categories. This is PN categorization is the process whereby a proper noun is 

described in detail in copending patent application Ser. No. assigned to a single category rather like the concept catego- 

08/135,815, filed Oct. 12, 1993, entitled "Natural Language ries used in SFC 100. Categories include city, state, country, 

Processing System For Semantic Vector Representation 55 company, person, etc. The current DR-LINK proper noun 

Which Accounts For Lexical Ambiguity," to Elizabeth D. classification scheme is expanded and modified from earlier 

liddy, Woojin Paik, and Edmund Szu-Li Yu. The application attempts, and over 40 concept categories, which in tests 

mentioned immediately above, hereinafter referred to as correctly account for over 89% of all proper nouns, with the 

"Natural Language Processing," is hereby incorporated by remainder being classified as "miscellaneous." This inven- 

reference for all purposes. 60 tion is not dependent on a specific number of concept 

A subject field code indicates the conceptual-level sense categories or a specific arrangement of categories, 

or meaning of a word or phrase. The current implementation, The proper noun classification scheme is based on algo- 

with 680 hierarchically arranged sub-categories, offers suf- rithmic machine-aided corpus analysis. In a specific 

ficient resolution without too much diffusion of the codes. implementation, the classification is hierarchical^ consisting 

The present invention, however, is not limited to a specific 65 of branch nodes and terminal nodes, but this particular 

hierarchical arrangement or a certain number of subject field hierarchical arrangement of codes is but one of many 

codes. arrangements mat would be suitable. 
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Table 1 shows a representative set of proper noun concept 
categories and subcategories. 

TABLE 1 

Proper Noun Categories and Subcategories 



Geographic Entity: 


Human: 


Qtv 


Person 


Port 


Tide 


Airport 


Document: 


Island 


Periodicals/Books 


County 


Treaties/Laws/Acts 


Province 


Equipment: 


Country 


Software 


Continent 


Hardware 


Region 


Machines 


Water 


Scientific: 


Geographic Miscellaneous 


Disease 


Affiliation: 


Drugs 


Religion 


Chemicals 


Nationality 


Organic Matter 


Organization: 


Temporal: 


Company 


Date 


Company Type 


Time 


Sports Franchise 


Holiday 


Government 


Miscellaneous: 


U.S. Government 


Miscellaneous 


Education/Arts Services 




Political Organization 




Religious Organization 





10 
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Gassification occurs in the following sequence: 

2.5.1 Proper noun suffixes, prefixes and infixes (e.g., 
Hospital, Senator, Professor) are examined for possible 
categorization information. 

2.5.2 The proper noun is passed to a database to determine 
if an alternative, standard form exists (e.g., President Bill 
Clinton for Bill Clinton). If the proper noun is an alias, the 
standard form is used for categorization. 

2.5.3 The proper noun is next run through context heu- 
ristic tests for possible categorization. Text-based clues are 
used for categorization. For example, if the proper noun is 
immediately followed by a comma and a state, county, or 
country name, then the proper noun is identified as a town, 
city or other geographic entity. Appositional phrases (noun 
phrases found in close proximity to proper nouns, usually 
revealing identifying information about the proper named 
entity) will also be detected and used in the categorization 
process. Numerous other heuristics are applied until the 
proper noun has been tested for inclusion in one of several 
categories. 

2.5.4 Proper names are compared to a database of sig- 



nificant personal first names for a possible match (e.g., such 

as a database collection of names in electronic phone 55 principles behind text structurer, see [Liddy93]) 



2.6 Complex Nominal (CN) Detector 120 
Complex nominals (e.g., budget amendment bill, central 

nervous system) are important information-bearing phrases 
detected by the DR-LINK system and used in the document- 
query matching process. CN phrases are recognizable as 
adjacent noun pairs or sequences of non -predicating and 
predicating adjective(s) and noun(s). These pairs or 
sequences can be recognized from the output of the POS- 
tagged text in conjunction with various unique processing 
tools developed from corpus analysis. In addition, CN 
phrases are recombined, or parsed, whereby meaningful 
complex nominal word combinations are extracted and 
indexed. For example, the CN "Information Retrieval Sys- 
tem" would be recombined to yield "Information Retrieval," 
15 "Retrieval System," and "Information System." A synony- 
mous phrase might be "Text Processing Software." Later 
matching algorithms weight these terms based on the 
assumption that a whole CN is a better, more specific 
indicator of the document's contents than the recombined 
constituent words. 

2.7 Single Term Detector 130 

The detection of CNs and PNs alone would not account 
for all of the information-rich content of typical English- 
language texts. Some nouns, conflated nouns (e.g. inkwell), 
verbs, adverbs and adjectives also contain important infor- 
mation about the subject-contents of documents, and are 
detected by the single term detector. Numbers and 
numerically-related information (e.g., "$" and other cur- 
rency symbols) are also recognized. 

2.8 Text Structurer 140 

Text structurer 140 provides valuable information about 
the sense and meaning of a text [Liddy94c]. The text 
structurer is based on discourse theory [VanDijk88] which 
suggests that textual communication within a given com- 
munity (journalism, law, medicine), or text of a certain genre 
(recipe, obituary, folk-tale) has a predictable schema. The 
schema serves as a reliable indication of how and where 
certain information endemic to a text-type will be displayed. 
The text structurer module produces an enriched represen- 
tation of each text by computationally decomposing it into 
smaller, conceptually Labeled components. The delineation 
of the discourse-level organization of document and query 
contents facilitates retrieval of those documents that convey 
the appropriate discourse semantics. For example, a query 
that displays an interest in evaluative information on a topic 
will be matched to documents based partly on the prevalence 
of evaluative comments on that topic within those docu- 
ments. 

Discourse theory and text structurer is founded in the 
observation that writers who repeatedly produce texts of a 
particular type are influenced by a somewhat rigid schema of 
the text type. That is, they consider not only the specific 
content they wish to convey but also various structural 
requirements (for a discussion of discourse theory and 



30 



35 



45 



50 



directories, sorted by frequency, or by the proper nouns 
found in the databases searched). An array of knowledge 
databases are used. New names and associations are con- 
stantly added and updated. 

2.5.5 Those proper nouns that remain uncategorized are 
assigned to the "miscellaneous" category: in tests fewer than 
11% of proper nouns are so assigned [Paik93a], [Paik93b]. 

2.5.6 Once identified, proper nouns can be expanded to 
include other synonymous proper nouns. For group proper 



In the current and preferred embodiment of text structurer, 
a departure from earlier implementations, various structural 
annotations (tags) are assigned based upon various evidence 
sources, including the presence and logical arrangement of 
60 clauses, phrases and combinations of words and punctua- 
tion. These structural tags express important aspects which 
can contribute to relevancy in a text, including time, opinion, 
and intention. The text structurer assigns these annotations 
or tags on the basis of (1) lexical clues or other linguistic 



nouns (Europe, Fortune 500 companies), the group proper 65 evidence learned from a corpus of text, which now com- 
noun is expanded to include all member proper nouns (e.g., prises a special lexicon, and (2) a regression formula that 
Germany and France, IBM and General Electric). includes multiple evidence sources at the word, sentence, 
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paragraph and document levels. For example, with newspa- 
per discourse, the text structurer is able to annotate several 
components of information in a text, including factual 
information, analysis, and cause-and -effect 

Id the current instantiation, text structurer treats queries as 
a unique discourse genre, and processing of queries is 
different to the processing of documents. In the general case, 
text structurer can be modified to accommodate many dif- 
ferent discourse types, including newspaper texts, patent 
applications, legal opinions, scientific journal articles, and 
the like, each of which exhibits internally consistent dis- 
course schemata. Different text processing can be applied to 
each. A discourse type or genre can be detected according to 
source information, author information, or other evidence. 

The text structurer provides (1) temporal information 
about a text (past, present and future), and (2) information 
about the meta-contents or intention of the text (whether the 
text contains analysis, cause/effect information, predictions, 
quotations, or evaluative comments). Dependencies can 
exist between the assignment of temporal and intention- 
based tags. 

Table 2 below shows the text structure tags used in a 
preferred implementation of text structurer 140. 

TABLE 2 



lag Description 



Text Structure Thgs 

Examples of Evidence Phrases 



AN 


Analysis or opinion of a 


Advantages; Disadvantages; In 




person, action or event 


anticipation of; Pro; Con. 


CE 


Cause and/or Effect Noted. 


As a means of, Gives rise to; 






Designed to; Affects, Impacts; 






Repercussions. 


CR 


Credential. 


Officer; Chief; Credential; Duties; 






Title. 


ED 


Editorial. 


Editorial. 


FA 


Factual Information. 


Number of, How many; The date 






of the highest; The least 


FU 


An Action or Event that 


Looking ahead to; Coming months; 




takes place in the Future. 


Emerging; Expected; Trends. 


HL 


Headline. 


(From Text Preprocessing) 


IN 


Instructions. 


Instructions; Directions; Method 






for; Ingredients; Steps in 






the process. 


LP 


Lead Paragraph. 


First paragraph of a document 


OB 


Obituary. 


Obituary, Death notice; Died 






today. 


OG 


An Action or Event is 


Over the months; Continuing; 




Ongoing in the Present 


Daily, Trends. 


PA 


An Action or Event that took 


In the last few years; In the past; 




place in the Past (1 yr or 


History; Ancient 




more). 




PR 


An Action or Event that took 


Past few months; 1st Quarter; 




place in the Recent Past 


Prior month; Recently. 




(one week to 1 yr. ago). 




QU 


A Direct or Indirect Quote. 


Statements by; Announces; 






Quoting; Testified. 


ST 


A Reference to Stock, Bond 


Dow opened; Nikkei closed; Stock 




or other Financial 


reports; Dividends, NYSE. 




Information. 




RV 


Reviews of a Product, 


Standards and specifications; 




Service, or other entity. 


Evaluate; Review; Test 



paragraphs associated with the target text and the sentence 
concerned. A given sentence can have multiple tags. Terms 
in a given sentence are tagged according to all tags for that 
sentence, as described below in the text structurer module 
description. In the preferred implementation, tags are 
assigned to a sentence as follows, using a two-step process: 

2.8.1.1 Document Aspects Vocabulary (DAV) database. 
The first step in assigning sentence-level tags to a document 
is to look up various identified evidence phrases (words, 
phrases, clauses, or collections of words and punctuation) in 
a Document Aspects Vocabulary (DAV) database. The DAV 
database contains a collection of evidence phrases (which 
can be phrases, clauses, sequences of words and 
punctuation, or a single word) that, taken alone or in a 
logically arranged sequence, suggest various intentions or 
temporal information in text. 

2.8.1.2 Aspects Probability Matrix (APM) database. In the 
second stage, tag scores are assigned to various evidence 
phrases according to probability scores assigned to a matrix 

20 of all identified evidence phrases. Based on an extensive 
corpus analysis of documents typical of a given discourse 
type, evidence phrases are assigned probability scores for 
any and all text structurer tags, based on the probability of 
that evidence phrase being included within a given text 

25 structurer. 

Table 3 shows an example of the database structure for the 
APM database. 
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Structure for AsDects Probability Matrix (APM} database 


Evidence Phrase 


AN Tag 


EVTag 


CE Tag FA lag 


Phrase #1 


0.811 


0.100 


0.000 0.005 


Phrase #2 


0.100 


0.144 


0.337 0.107 


Phrase #3 


0.000 


0.000 


0.567 0.122 
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This list is not exhaustive and not all tags are necessary. The 
table also shows various sample evidence phrases used to 
help identify possible tag positions and assignments. For 
example, the "AN" "analysis or opinion tag uses evidence 
phrases such as "advantage," "disadvantage," and "in antici- 
pation of," along with other lexical and grammatical clues. 
2.8.1 Assigning Tags. 

In the general case, documents are tagged at the sentence 
level, with indexed annotations indicating the paragraph, 
position in paragraph, length of sentence, and number of 



The table shows the matrix of evidence phrases and text 
structurer tags, with each cell in the matrix containing the 
given probability value. A probability value is calculated 
based on the number of occurrences of that evidence phrase 
within a given text structure component, as a fraction of all 
occurrences of that evidence phrase in the test corpus. The 
probability values are normalized to account for the different 
distributions of the various text structure tags in the training 
data. 

In addition to text structurer tags assigned using DAV 
database evidence, the following method is used to assign 
text structurer tags at the sentence or clause level using the 
APM database and other evidence. This method is as fol- 
lows: 

2.8.1.3 All evidence phrases in a sentence are analyzed 
using the APM database for all text structurer tags. A 
summed score for all tags is produced using a Dempster- 
ss Schaffer formula, or similar formula. This score is used as an 
independent variable in a logistic regression equation 
described below. 

(a) The same score as above is generated, except the 
summation does not use a Dempster-SchafTer formula. 

(b) The following evidence sources are calculated: num- 
ber of words in the sentence under consideration; the 
number of paragraphs in the document under consid- 
eration; the number of sentences in the paragraph under 
consideration; the relative position of the sentence with 
reference to the paragraph under consideration; and the 
relative position of the sentence in regard to tbe docu- 
ment under consideration. 



60 
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(c) The evidence sources in the first three paragraphs 
above are used in a logistic regression equation, least 
squares fit, separately for each of the tag assignments. 
Coefficients for each of the 7 terms in the regression 
formula are computed using training data for specific 5 
discourse types or genres. The output at this stage of 
processing is a score, normalized between 0 and 1, for 
each of the text structurer tags. The score represents the 
likelihood that a given text structurer tag should be 
assigned to a given sentence in a document. 10 

(d) For each text structurer tag, a minimum threshold 
value is assigned for the logistic regression, such that a 
tag is only assigned to a sentence (or clause) if the 
regression value exceeds this pre-determined threshold 
value. The threshold value for each tag is calculated 15 
based on extensive corpus analysis using training data. 

2.8.2 Generating Tags 

In the preferred embodiment two methods are used to 
generate evidence phrases in texts (queries and documents). 
In the first method, the natural language processing abilities 20 
of the DR-LINK system are exploited to automatically 
detect information-bearing words, phrases or clauses. For 
example, POS tagger 90, PNC 110, and related processing 
elements of the DR-LINK system are automatically able to 
detect appositional phrases related to proper nouns, and 25 
these are used as evidence phrases. In the second method, all 
single words, adjacent words pairs, words triples, etc. are 
extracted in overlapping sequence from sentences and used 
as the units of analysis for constructing evidence phrases. 

2.8.3 Indexing Text Structure 30 
The text structure tags are automatically incorporated into 

the index representation of terms in a document. 

Table 4 shows a specific implementation of the index term 
format. 

35 

TABLE 4 



1. Index Term 


2. 


3. 


4. 


5. 


6. 


7. 


Document retrieval 


25 


3425 


2 


FA/QU 


1/9 


525 


Information retrieval 


19 


3425 


1 


FA 


1 


131 



As can be seen in the table, there are 7 fields in each term 
record, consisting of an index term 1, the number of docu- 45 
ments in the database which have the index term 2, the 
document identification in which the index term occurs 3, 
number of occurrences of the index term in the document 4, 
text structurer tags which are assigned to the sentences in 
which the index term occur 5, logical paragraph identifica- 50 
tion in which the index term occur 6, and total number of 
indexed terms in the document 7. 

The text structurer is used as a partial requirement for 
relevancy in the matching process. Stated briefly, in the 
query to document matching process, each query term is 55 
searched against document index terms. One of the metrics 
used for assigning relevance scores, called positive text 
structurer (PTS), requires that a match be based on the 
presence of query terms found within the correct text 
structurer component. More details on PTS-based matching 60 
is given in the later description of the matcher. 

2.9 Term Indexer 150 

Term indexer 150 indexes terms and SFC 100 indexes 
SFC vector data in related files, shown collectively as index 
file 75. Other document-based indexing is possible. The 65 
term index is a two-tier inverted file. The first level of the file 
contains terms, where a term can be a word (single term), a 



16 

complex nominal, or a proper noun. The second level of the 
file contains postings (document references) with associated 
scores. The scores are an indication of the strength of the 
association between the term and the document. A single 
term will usually map to numerous postings, each with a 
score, as shown in FIG. 4. Terms are also indexed with 
reference to their location within the text (both as logical 
paragraphs and regarding text structure). 

Indexing involves extracting terms from the text, check- 
ing for stop words, processing hyphenated words, then 
stemming all inflected terms to a standard form. Finally, for 
each document the within document Term Frequency (TF) is 
calculated; the product of TF and the Inverse Document 
Frequency (IDF) is used as the basis for the postings 
score — a measure of the relative prominence of a term 
compared to its occurrence throughout the corpora. TF.IDF 
scores are also cataloged for a varying number of logical 
paragraphs in a given document. 

A logical paragraph is a subsection of a complete 
document, which may contain one or several text 
paragraphs, depending on the length of the overall docu- 
ment. Documents are divided into logical paragraphs based 
on size and natural transitions in a text, such as paragraph 
boundaries or subhead boundaries. Later matching can occur 
within a logical paragraph, so as to highlight the most 
relevant logical paragraph or the portion of a long document 
deemed most relevant to a query. While the preferred 
implementation uses the 16-unit logical paragraph arrange- 
ment described above, alternative implementations are pos- 
sible. One such implementation is to divide the document 
into an unrestricted number of subsections that correspond 
to each and all of the natural paragraph boundaries in a text. 
3.0 Query Processing 

3.1 Query Processing Overview 

FIG. 5 is a block diagram showing the query processing 
modules within query processor (QP) 50Q. Queries are 
processed in a different manner to documents, although the 
evidence extracted from query text is very similar to the 
evidence extracted from document texts, and therefore some 
of the modules perform the same type of processing. The set 
of modules that perform the processing to generate the 
conceptual representation and the term-based representation 
of each query includes: 

a preprocessor 160; 

a meta-phrase identifier 165, with its associated meta- 
phrase evidence database used to identify meta-phrases 
in the query; 

a proper noun (PN) categorizer (PNC) 170; 

a text structure requirement identifier 180, with its asso- 
ciated text structure requirement database (similar to 
the text structure evidence phrase database, but for 
queries); 

a complex nominal (CN) detector 190, 

a PN expander 200, with its associated PN expansion 

database used to find synonymous expansions for 

stated PNs; 

a PN clarifier 210, with its associated PN clarification 
database; 

a CN expander 220, with its associated CN expansion 
database used to find synonymous expansions for 
stated CNs; 

a sublanguage processor 230, with its associated sublan- 
guage processing database used to identify the logical 
form of the query; 

a negation identifier 232, with its associated negation 
database used to identify negative portions in the query; 
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a mandatory term identifier 235, with its associated man- proper nouns contained in the query. For example, if the user 

datory term database used to identify mandatory has asked for information about "Far East countries," the 

requirements stated in the query; proper noun expander offers the user the fofiowing member 

a weighted Boolean processor 240; and meronyms for "Far East": Japan, South Korea, North Korea, 

a subject field coder (SFQ 250. 5 Taiwan, and China. The user can decide whether or not to 

3.2 Query Preprocessor 160 "« these expanded terms in the query. 

Query preprocessor 160 performs the same tasks for These expansion terms are entered into the proper noun 

queries as preprocessor 80 performs for documents. expansion database by analyzing the corpus to find proper 

3.3 Meta-Phrase Identifier 165 nouns which are related by the above exemplified semantic 
Meta-phrase identifier 165 performs the task of detecting 10 re5a ticms. In addition, proper noun expansion database 

words or phrases in a query that are used to state (or cntnes caa bc cntercd manually using existing reference 

expand upon) the query. For example, if a user asked: sources. 

"I would like information about space ships, or any 3 8 ^I** Noun ( PN ) Clanfier 210 

materials on lunar shuttles," using the meta lexicon, the PN clanfier 210 automatically provides the system user 

phrases "I would like information about" and "any c ^ ambiguous interpretations for proper nouns contained 

materials on" would be tagged as meta phrasing using m ±G V* 1 * For example, if the user has asked for infor- 

an SGML marker. These words and phrases are then mation about "Clinton," the proper name clarifier offers the 

removed from the query processing stream, and are not user the following possible interpretations for "Clinton": 

used as search terms William Clinton, Hillary Clinton, David Clinton, and Robert 

3.4 Proper Noun Categorizer (PNC) 170 Clinton. The user can decide or clarify whether or not to use 
PNC 170 performs the same task for queries as PNC 110 20 certain interpretations of the proper nouns in the query. 

does for documents. These clarifiable terms are entered into the proper noun 

3.5 Text-Structure Requirement Identifier 180 clarification database by automatically or manually creating 
Text-structure requirement identifier 180 performs a simi- possible variants of the proper nouns in the corpus and then 

lar analysis of queries as text structurer 140 performs of creating a mapping table which consists of pairs of variants 

documents. However, while the text structurer operates at 25 and the P ro P er noun - tne above example shows the 

the sentence level or on clause level, in the preferred names of the people who all share the same last name. Thus, 

embodiment, the text-structure requirement identifier oper- ^ tcrm ' "Clinton" newk ^ clarified, 

ates upon the whole query. That is, the whole query is " Cbmpfex Nominal (CN) Expander 220 

categorized by tense requirement (past, present future), and CN e f x P a ° der 220 P^*f tl Jf ***** T' 

by intention requirement (prediction, analysis, facts, etc.). 30 "5™ f °! P hrase f. * ^ ^ i** ™F* 

' , / j- • • j f .L it. t j asked for information about wealthy individuals emigrating 

IMs an understandmg is gained of the overall temporal and from £ knd/ , ^ com kx nominal der ^ ^ 

discourse aspect requirements of the query. An alternative user the following synonyms for "wealthy individual- rich 

implementation would assign tags at the individual sentence persoDj wealthy pBao% rfch ^ / ffluent individuaU 

or clause level. anc j afg ucn t person. ITie user can decided whether or not to 

Text-structure requirement identifier codes are identical to 35 use these synonyms in the query, 
the codes used in the text structurer. Similar heuristics are To generate these synonyms, two methods are used. First, 
used to place text-structure requirement identifier codes, a CN database having a list of CN synonyms, based on 
although variant lexical and discourse-level clues are corpus frequency of particular complex nominals, is con- 
employed in keeping with the variant structure of query suited. If there is a match here, the synonyms from this CN 
statements. Codes are not mutually exclusive — any combi- 40 database are used. If there is no match in the database, then 
nation of requirements can be assigned to the same query. automatic word substitution for each word in the CN is 

For queries, the assignment of the text structurer tags is performed using an online single term CN database. Pos- 

made using an extensive Question Aspect Vocabulary sible synonymous phrasings generated by this method are 

(QAB) database. The QAB database contains a collection of checked against corpora indices to confirm that the new 

evidence phrases (which can be phrases, clauses, sequences 45 construction does occur in some index. If the phrase does not 

of words and punctuation, or a single word) that, taken alone occur in any index, it will be removed from the list of 

or in a logically arranged sequence, suggest various inten- synonyms to be presented to the user, 

tions or temporal information in text. 3.10 Sublanguage Processor 230 

A complex of clues are used to establish tag assignments. Sublanguage processing is the beginning of a transition 
The assignment of a tag may be based on the presence of a 50 from a natural language query representation to a pseudo- 
single evidence clue (single words, phrases, clauses, or logical representation of the query contents. In the preferred 
sequences of words and punctuation) in a query, or upon a embodiment this is another heuristic system, but other 
collection of such clues in a query. Alternatively, tag assign- approaches may be taken. The initial sublanguage process- 
ment may be based on the logical arrangement of evidence ing of the query involves tokenization, standardization and 
clues found in the QAB database, whereby evidence clues 55 the resolution of anaphoric references, 
must appear (1) in a specified sequence, (2) connected Part of this sublanguage is a limited anaphor resolution 
logically, using operators such as AND, OR or NOT, or a (that is, the recognition of a grammatical substitute, such as 
combination of (1) and (2). In the preferred implementation, a pronoun or proverb, that refers back to a preceding word 
if no text structurer tag can be assigned using the QAB or group of words). An example of a simple anaphoric 
database evidence clues, then the default tag is Lead Para- 60 reference is shown below: 

graph (LP). "I am interested in the stock market performance of IBM. 

3.6 Complex Nominal (CN) Detector 190 I am also interested in the company's largest foreign 
The CN detection techniques for queries are the same as shareholders.** 

those used by CN detector 120 for documents. In this example, the phrase "the company's" is an anaphoric 

3.7 Proper Noun (PN) Expander 200 65 reference back to "IBM/* The QP module substitutes the 
PN expander 200 automatically provides the system user referent (IBM) in anaphors before creating the logical rep- 

with synonyms, hyponyms, or member meronyms for resentation. 
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After this initial processing, the natural language query is 
decomposed to an ensemble of. logical or . pseudo-logical 
assertions linking portions of the query, or various terms (or 
groups of terms). A series of operators are used to make 
these logical and pseudo-logical assertions. These operators 
relate terms and parts of the query text together, and also 
assign scores according to the formulas in Table 5 and as 
described below. Different operators assign different scores. 

TABLE 5 



Operators Used for Boolean Representation 



Operator Operation 



Fuzzy Weight/Score 



AND 


Boolean AND 


Addition of scores from ANDed terms 


OR 


Boolean OR 


Maximum score from all ORed terms 


SNOT 


Negation 




#AND 


Conditional AND 


head term #AND tail term. If bead term 
present, revert to AND, else 0 


•AND 


Mandatory marker 


query *AND mandatory. Used to 
separate mandatory elements for 
later foldering. Scores as AND. 


-AND 


Proximity AND 
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First, the Query Processor (QP) automatically constructs a 
logical representation of the natural language query. The 
user is not required to annotate the query in any way. A tree 25 matching). 



[Clue words or phrases must have a high probability within 
the confines. of . a particular context.] 

3.10.2 Component Ordering. Components in a query tend 
to occur in a certain predictable sequence, and this sequence 
can be used as a clue to establish negation. 

3.10.3 Continuation Clues. Especially in relatively long 
queries a useful clue for the user's conjunction or disjunc- 
tion requirements across sentence boundaries is relations 
which occur near the beginning of a sentence and which 
have been observed in tests to predictably indicate the nature 
of the logical transitions from sentence to sentence. 

3.11 Negation Identifier 232 

Negation detection is unique to queries. It is common for 
queries to simultaneously express both items of interest and 
those items that are not of interest. For example, a query 
might be phrased "I am interested in A and B, but not in C." 
In this instance, A and B are required (they are in the 
"positive" portion of the query) and C is negated and not 
required (it is in the negative portion of the query). Terms in 
the positive and negative portions of the query are consid- 
ered for document matching. Terms in both portions of the 
query are used for foldering assignments, while terms in the 
positive portion of the query are used in calculating logistic 
regression matching scores (see later discussions on 



structure with terms connected by logical operators is con- 
structed. Consider the example query below: 

"I am interested in any information concerning A and B 
and C, D or E and F." 
The tree representation of this query is shown in FIG. 6. 
Various linguistic clues such as lexical clues and punctuation 
are used to determine the logical form of the query: The 
basis of this system is a sublanguage grammar which is 
rooted in generalizations regarding the regularities exhibited 
in a large corpus of query statements. 

The sublanguage relies on items such as function words 
(articles, auxiliaries, and prepositions), meta-text phrases, 
and punctuation (or the combination of these elements) to 
recognize and extract the formal logical combination of 
relevancy requirements from the query. In the very simple 
query stated above, the positions and relations of the prepo- 
sition "concerning", the conjunctions "and" and "or", and 
the comma and period are used together to produce the 
appropriate logical relationship between the various items A 
through F. The sublanguage interprets the query into pattern- 
action rules which reveal the combination of relations that 
organize queries, and which allow the creation from each 
sentence of a first-order logic assertion, reflecting the Bool- 
ean and other logical assertions or relations in the text. 

The sublanguage processor uses the principles of text 
structure analysis and models of discourse to automatically 
identify conjunction, disjunction, mandatory, positive, and 
negative portions of a query. The principles employed are 
based on the general observation among discourse linguists 
that writers are influenced by the established schema of the 
text-type they produce, and not just on the specific content 
they wish to convey. This established schema can be delin- 
eated and used to computationally instantiate discourse- 
level structures. In the case of the discourse genre of queries 
written for online retrieval systems, empirical evidence has 
established several techniques for locating the positive, 
negative, disjunction, conjunction, and mandatory aspects: 

3.10.1 Lexical Clues. There exists a class of frequently 
used words or phrases that, when used in a logical sequence, 
establish the transition from the positive to the negative 
portion of the query (or the reverse). Such a sequence might 
be as simple as "I am interested in" followed by but not". 
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3.12 Mandatory Term Identifier 235 

In addition to the logical assertion described above the 
query is also divided into mandatory and non-mandatory 
portions by sublanguage and processor 230. It is common 
practice for a query to be stated such that one or more terms 
in the query are essential for relevance. For example, a query 
might be stated as follows: 

"I am only interested in documents that discuss A and B." 

Using various linguistic clues the system recognizes these 
mandated requirements, and divides the query into two 
portions using the *AND operator. In an earlier 
implementation, the *AND operator assigned no weighted 
score to terms in the mandatory or non-mandatory portion of 
the query representation, but the matching of mandatory 
terms with a document was used for later segregation (e.g., 
foldering) of relevant documents. In a current 
implementation, the mandatory portion of the query is 
incorporated into the logical tree structure of the query 
through the *AND operator at the top level. Therefore, the 
tree structure of the query is <query> *AND <mandatory_ 
portion>. 

3.13 Weighted Boolean Processor 240 

As noted above, FIG. 6 shows a tree structure of the query, 
and the manner in which a weighted Boolean score 
(sometimes referred to as the fuzzy Boolean score) is 
assigned for each term (PN, CN, or single term) in the 
logical query representation. The logical representation of 
the requirements of the query consist of a head operator 255, 
which can be any operator, which links in a tree structure 
through nodes 257 and Boolean operators to various 
extracted query terms 260 at terminal nodes. Each term is 
assigned a possible term weight score 262. Scores are 
normalized such that the highest attainable score during 
matching (if all terms are successfully matched with a 
document) is 1.0. 

During matching the fuzzy logical AND operator per- 
forms an addition with all matched ANDed term scores. The 
fuzzy OR operator selects the highest weighted score from 
all the matched ORed terms. For example, in the query 
representation of FIG. 6, if terms A, C and F are matched, 
then the score assigned the match would be 0.66 (that is, 
0.33 from the match with query term A, and 0.33 from the 
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match with C, which is the higher of the ORed C and F 4.2.1 Positive Quorum (PQ) 

weighted scores). Recombinations and expansions of PNs - The PQ is the fuzzy Boolean score for all terms in the 

and CNs are assumed to be less precise representations of positive portion of the query, computed as discussed above 

the specific query requirements: Their score assignments in connection with weighted Boolean processor 240. 

reflect this and are calculated to be less than that of the 5 4.2.2 Positive Term (PT) 

specified CN or PN. The PT is a combination of the TRIDF scores for the 

Note that the mandatory portion of the query is automati- tGTms io mc positive portioD of the query. The product of 

cally assigned a maximum possible weight of 0.5 with the TF.IDF for a given term in a document provides a quanti- 

entire query also being assigned a maximum possible weight totive indication of a term's relative uniqueness and impor- 

of0.5. This means that terms in the mandatory portion of the 10 ^nce for matching purposes. A natural-log form of the 

query, if matched in the document, contribute twice to the equation for TF.IDF, where TF is the number of occurrences 

overall score. °f a tenn within a given document, and IDF is the inverse 

3.14 Subject Field Codes (SFC) Module 250 of tQe number of documents in which the term occurs, 

Subject field codes are assigned to each substantive word compared to the whole corpus, as shown below: 

in the positive portion of the query. The method of assign- 15 

ment and the arrangement of codes is similar to that used by TRIDF-(l«CnO+l).ii»(N+i/ n ) 

SFC 100 for document vector generation as described . KT - . 4 , , c , . . 4l _ , 

a ^ ve 6 where N is the total number of documents in the corpus, and 

. _ ' w , „ . n is the number of documents in which the term occurs. 

4.0 Document Matching and Presentation to User ^ ^ IDF mm m calculated for the documents. 

4.1 Matching Overview The way that the TFJDF scores are combined for the PT 
Matcher 55 matches documents by comparing the docu- is in accordance with the combination of scores discussed 

ments with the query and assigning each document a simi- above in connection with weighted Boolean processor 240 

larity score for the particular query. Documents with suffi- (i.e., based on the structured logical representation of the 

ciently high scores are arranged in ranked order in three 25 query described earlier). However, the scores for the nodes 

folders, according to their relative relevance to the substance are equal to the TF.IDF scores for the terms rather than the 

of a query. There are a number of evidence sources used for normalized scores described above (maximum scores of 

determining the similarity of documents to a query request, 033, 0.33, 0.33, 017, and 0.17 for the example in FIG. 6). 

including: 4.2.3 Positive Text Structure (PTS) 

Complex Nominals (CNs)* 30 The PTS is the fuzzy Boolean score for all query terms in 

Proper Nouns (PNs)* the positive portion of the query matched within the correct 

Subject Field Codes (SFCs) text structure component/s. For positive text structurer, each 

Single Terms* query term is assigned with a weight which is based on how 

Text Structure the query terms are organized as a logical requirement and 

Presence of Negation 35 on the text structurer requirements extracted from the query 

Mandatory requirements statement. The assignment of PTS scores is as follows: 

* CNs, PNs, and Single Terms are collectively called 1. IF a query term does not match with any one of the 

"terms." index terms for a document, THEN no PTS score is gener- 

Documents are arranged for the user based on a two-tier ate d based on the query term, 

ranking system. The highest-level ranking mechanism is a 40 2. ELSE, IF a query term matches with one of the index 

system of folders. Documents are placed within folders terms for a document AND IF the text structurer require- 

based on various criteria, such as the presence or absence of ments which are assigned to the query term do not have any 

mandatory terms. The lower-level ranking mechanism sorts common text structurer tags assigned to the index terms, 

documents within each folder based on criteria such as THEN no PTS score is generated based on the query term, 

similarity score, document date assignment, etc. 45 3. ELSE, IF there is at least one term in common between 

The operation and function of the matcher is not depen- the query and a document and that term also has the common 

dent on the number or specific interrelationship of the text structurer tag, THEN a PTS score is generated. The 

folders, or on the within-folder mechanisms used to rank and score is the product of the query term weight and the number 

display documents, or on the number of evidence sources of matching text structurer tags, divided by the total number 

used to compute document relevance scores as described 50 of text structurer tags which are assigned to the query term, 

below. Consider the following example: 

Using the evidence sources mentioned above, the matcher 
determines the similarity or suitable association between 

query and documents. Foldering is based on the presence or — — — ^ 

absence and logical relationship in a document of query 55 <positive> 

terms, negation and mandated terms. Within folders, docu- 1 document_retri e v a i |AN, FU| 0.5 

ment rank position is computed using match scores for the &and 

whole document and for up to 8 segments, or logical information_extraction |an, fu| 0.5 

paragraphs, that make up the document (see earlier discus- J _ 

sions for an explanation of logical paragraphs). 60 

4.2 Scoring 

Five sources of evidence are used to compute five indi- where a query consists of two terms ("information retrieval" 

vidual measures of similarity (scores) between the query and and "information extraction"), with the query text structurer 

a given document, and the five individual scores are com- assignments being Analytic (AN) and future (FU). The score 

bined to form a single relevance score. The five sources of 65 assignment based on PTS for the presence of each term in a 

evidence, normalized (where appropriate) for document document within the correct text structurer is 0.5, respec- 

length, are: tively. 
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If there are three documents with the following index is a linearly transformed value which is based on the number 

terms (for a description of this representation of terms, see of words in the document in consideration, 

section on text structurer): In addition, one or more individual variables can be 

transformed and normalized before the values are used in the 

5 regression formula or any other formula which are used to 
generate relevance scores. For example, it is possible to 
transform every individual variable to account for the length 
of the document in consideration instead of using an extra 
independent variable, 

then the PTS score for document 3425 will be: io 43 Foldering 

PTS score based on document_retrieval: 0.5*1/2=0.25 Documents are ordered within folders by their logistic 

PTS score based on information_extraction: 0.5*0/2=0 probability values, SFC values, date, or other specified 

PTS score for 3425=0 25+0=0 25 criterion. The main foldering scheme is based on various 

4.2.4 Positive Paraaraph Matching (PPM) match criteria and/or match scores - 0ther folderin S schemes, 
The fuzzy Boolean score for query terms in the positive 15 ^ ch 45 yiew-by subject, will be described in a later section 

portion of the query is computed for each logical paragraph ™ c ™* numb * r of documents in all folders can be selected 

in the document. The PPM is the largest of these scores for b * (** ^cr discussion). In this preferred 

that document ~ embodiment, three folders are used. The assignment of 

4.2.5 Subject Field Code (SFC) Vector Match documents to folders is determined as follows: 

For the SFC match score, first the subject vector for each 20 . 4 f 1 1 Fold f ° ne ' s fg le tenns a PP ear m a 

text (document or query) is normalized using a term weight- fmgle logical paragraph of the document; the negative 

ing formula in order to control for the effect of document lo g lcal requirement is not satisfied, 
length. The matching score between the query and document 

is determined by the correlation or association between M terms lhat satisfy ^ { ical ^ of the 

them, which is computed by a similarity measure that treats (complex nomma is, prop er nouns, single terms, or suitable 

the query and document vectors as two data points in the expansions) match; thc ^g^^ logical requirement is not 

multi-dimensional space and then computes the distance satisfied 
between these two data points. 

4.2.6 Combining Individual Scores OR 

In the preferred implementation, a logistic regression 30 All query terms (or appropriate expansions) in the manda- 

analysis using a Goodness of Fit model is applied to com- tory portion of the logical representation of the query match; 

bine individual scores which are described in sections 4.2.1 the negative logical requirement is not satisfied, 

to 4.2.5. Thus, the individual scores act as independent 4.3.2 Folder Two. Documents that have scores sufficient 

variables in the logistic regression formula. The combined to pass either the user-selected cut-off for the number of 

score is also referred to as the relevance score. 35 documents displayed or the system determined cut-off for 

Other formulas can be used to combine individual scores relevance, but the documents do not qualify for Folders One 

to generate relevance scores. Relevance scores can be cal- or Three. 

culated using different methods using the same or similar 4.3.3 Folder Three. All Unique single terms appear in a 

evidence sources. For example, it is possible to use a Nearest single logical paragraph of the document; the negative 

Neighborhood approach [Hanson90] to ascertain which 40 logical requirement is satisfied, 
documents match with a given query. 

Five independent variables are used (the implementation OR 

of the matcher is not dependent on the number of evidence AH q uer y tcrms tnat satisfy the logical truth of the query 

sources used). Regression coefficients for each variable are (complex nominals, proper nouns, single terms, or suitable 

calculated using an extensive, representative test corpus of 45 expansions) match; the negative logical requirement is sat- 

documents for which relevance assignments to a range of isfied. 
queries have been established by human judges. 

Using the evidence sources listed above, the logistic ... , . . . . , 

probability (logprob) of a given event is calculated as tem ? s \& appropriate expansions) in the manda- 

follows* 50 tory P° rtl0n °f me logical representation of the query match; 

the negative logical requirement is satisfied. 
4.4 Retrieval Criteria and Recall Predictor 

bgprob (evenQ-l/tl+e-**) The matching of documents to a query organizes docu- 
ments by matching scores in a ranked list. The total number 

where Z is the linear combination: S5 of presented documents can be selected by the user, the 

Z-B +5^2+ BsX s system can determine a number using the Recall Predictor 

(RP) function, or, in the absence of user input, the system 

and Bj.5 are the regression coefficients for the independent will retrieve all documents with a non-zero score. Note that 

variables X 3 . 5 . Documents are ranked by their logistic documents from different sources are interfiled and ranked 

probability values, and output with their scores. 60 in a single list. 

One or more but not all independent variables can be The RP filtering function is accomplished by means of a 
removed from the formula to generate relevance scores. multiple regression formula that successfully predicts cut- 
Furthermore, additional of independent variables beyond off criteria on a ranked list of relevant documents for 
those described in sections 4.2.1 to 4.2.5 can be included. individual queries based on the similarity of documents to 
For example, in the preferred implementation, an extra 65 queries as indicated by the vector matching (and optionally 
independent variable, which represents the length of each the proper noun matching) scores. The RP is sensitive to the 
document in the database, is used. This independent variable varied distributions of similarity scores (or match scores) for 
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different queries, and is able to present to the user a certain 4.6 Developing "Informed" Queries for Relevance Feed- 
limited percentage of the upper range of scored documents back ... 
with a high probability that close to 100% recall will be Relevance feedback is accomplished by combining the 
achieved. The user is asked for the desired level of recall (up vectors of user-selected documents or document clusters 
to 100%), and a confidence interval on the retrieval. While 5 with the original query vector to produce a new, "informed" 
in some cases a relatively large portion of the retrieved query vccton "informed" qucry vec tor will be matched 
documents would have to be displayed, in most cases for against dl document vectors ^ mc corpus or ^ ^ have 
100% recall with a 95% confidence interval less than 20% alread d ^ cutK)ff filtef Releyant documents ^ be 
of the retrieved document collection need be delayed. In re . ranked Md re . clustered . 

trials of the DR-LINK sys em (level of recall 100%, conn- Combining of Vectors. THe vector for the original 

dence level 95%), the system has collected an average of , „ 6 . 4 , , 4 . - ^ , 

97% of all documents judged relevant for a given query "? user " selected . documents are weighted and 

riiddY94bl combmed to form a new, suigle vector for re-ranking and 

4.5 Clustering re-Clustering. 

Documents can be clustered using an agglomerative 4 6 2 Re-Matching and Ranking of Corpus Documents 
(hierarchical) algorithm that compares all document vectors 15 Ncw > "Informed" Query Vector. Using the same simi- 
and creates clusters of documents with similarly weighted Iaritv measures described above for matcher 55, the 
vectors. The nearest neighbor/ward's approach is used to "informed" query vector is compared to the set of vectors of 
determine clusters, thus not forcing uniform sized clusters, all documents above the cut-off criterion produced by the 
and allowing new clusters to emerge when documents initial query (or for the whole corpus, as desired), then a 
reflecting new subject areas are added. These agglomerative 20 revised query-to-document concept similarity score is pro- 
techniques, or divisive techniques, are appropriate because duced for each document These similarity scores are the 
they do not require the imposition of a fixed number of system's revised estimation of a document's predicted rel- 
clusters. evance. The set of documents are thus re-ranked in order of 

Using the clustering algorithm described above, or other decreasing similarity of each document's revised predicted 

algorithms such as single link or nearest neighbor, 25 relevance to the "informed" query on the basis of revised 

DR-UNK is capable of mining large data sets and extracting similarity value. 

highly relevant documents arranged as conceptually-related 4.6.3 Cut-Off and Clustering after Relevance Feedback, 

clusters in which documents (possibly from several Using the same regression formula described above in 

languages) co-occur. connection with the recall predictor, a revised similarity 

Headlines from newspaper articles or titles from docu- 30 score cut-off criterion is determined by the system on the 

ments in the cluster are used to form labels for clusters. basis of the "informed" query. The regression criteria are the 

Headlines or titles are selected from documents that are near same as for the original query, except that only the vector 

the centroid of a particular cluster, and are therefore highly similarity score is considered. The agglomerative 

representative of the cluster's document contents. An alter- (hierarchical) clustering algorithm is applied to the vectors 

native labeling scheme, selectable by the user, is the use of 35 of the documents above the revised cut-off criterion and a 

the labeled subject codes which make up either the centroid re-clustering of the documents will be performed. Given the 

document's vector or the cluster vector. re-application of the cut-off criterion, the number of docu- 

The user is able to browse the documents, freely moving ment vectors being clustered will be reduced, and improved 

from cluster to cluster with the ability to view the full clustering is achieved, 

documents in addition to their summary representation. The 40 4.7 Variations on MLM 

user is able to indicate those documents deemed most There are a number of different ways to implement the 

relevant by highlighting document titles or summaries. If the MLM functionality. First, while the current implementation 

user so decides, the relevance feedback steps can be imple- combines the selected (or marked) document representations 

mented and an "informed" query can be produced, as with the initial query representation to generate a revised 

discussed below. 45 query representation, it is also possible to base the query 

The DR-UNK system is thus able to display a series of entirely on the document representations, and ignore the 

conceptually-related clusters in response to a browsing initial query representation. Additionally, while it is possible 

query. Each cluster, or a series of clusters, could be used as to rely on the stored document representations, it may be 

a point of departure for further browsing. Documents indica- more efficient, especially if the user selects only portions of 

tive of a cluster's thematic and conceptual content would be 50 a document, to reprocess the selected documents to generate 

used to generate future queries, thereby incorporating rel- the revised query. In a current implementation, the latter is 

evance feedback into the browsing process. done. 

FIG. 7 shows a sample result of agglomerative algorithm The MLM functionaUty gives rise to an additional way to 

based document clustering. Each document is represented as use the DR-LINK processing capabilities. A set of docu- 

the headline of the document in the far right column. The 55 ments need not arise from running a query using the 

numbers which are placed before the headlines show the DR-LINK system. Any collection of relevant documents, 

document clustering steps. The first and the second docu- including a single document, could be used to formulate a 

ments in the far right column are identified as the members query to find additional documents like the collection. These 

of the first cluster (i.e., cluster A). The third and fourth documents need only be identified to the DR-LINK system 

documents become the second cluster (i.e., cluster B). Then 60 and processed as if they were MLM documents arising from 

the first and the second cluster form the third cluster (i.e., a query. If the documents were not in the database, their 

cluster C). In the final step, the cluster which contains 8 representations would have to be generated and combined, 

documents from the top (i.e., cluster G) is combined with Prior art searching is an example of an application where 

another cluster which contains three documents from the such a "queryless search" capability could be particularly 

bottom (i.e., cluster H) to form one final cluster (i.e., cluster 65 useful. The user could be aware of a set of documents, which 

I). It is convenient to consider the representation of the had they been published earlier, would be highly relevant 

cluster as a tree. prior art. By identifying such documents, the user could run 
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a query whose representation was based on these docu- of documents deemed highly relevant, and printing or stor- 

ments. By limiting the search to an early enough date range, ing marked documents). The actual process of text retrieval 

the retrieved documents would be highly likely to represent is fully interactive and recursive: Users are able to navigate 

highly relevant prior art. through the system at will using any combination of the 

5.0 Graphic User Interface Overview System 5 steps outlined below. Not all steps are required, nor is the 

In general, the graphic user interface (GUI) for the specific sequence of steps required. 
DR-liNK information retrieval system is a sequence of Prior to initiating query processing, the user selects data- 
related screens, windows and associated graphical environ- bases (300a — see FIG. 12), selects a date range (300ft — see 
ments that facilitate interactions with users. Specifically, the FIG. 13), selects a number of documents to retrieve (300c), 
GUI allows users to: interact with the system to select data 10 composes a natural language query (30W), invokes spelling 
resources; to create a natural language query; to alter, and grammar checking (300^), and initiates query process- 
expand or otherwise interact with the computer-generated ing (300/). 

query representation; to select criteria for retrieving, ranking As described in detail above, the query processor gener- 
and displaying documents; and to re-submit a query based ates a number of representations of the query. As will be 
on the contents of documents considered highly relevant. 15 described below, manifestations of these representations are 
The GUI allows the user to interact and influence the various displayed (see FIGS. 14A and 14B), and the user is given the 
processing elements of the DR-LINK system described opportunity to determine whether the system's analysis of 
earlier in this application. the query is satisfactory or needs modification. These rep- 
Like the DR-LINK system, the GUI can exist is a variety resentations include proper noun (300g), complex nominal 
of computing environments using a variety of software, 20 (300A), SFC (30ft), meta-phrase (300/), time frame (300*), 
hardware and operating systems. The specific instantiation single term (300/). The system also provides the system's 
discussed in this application is for the Microsoft Windows interpretation of which terms in the query are deemed to be 
operating environment, from the Microsoft Corporation, mandatory (300m), and solicits user input. Once the user has 
Redmond, Seattle, Wash. Other instantiations for the GUI modified the system's interpretation of the query, the user 
include an online, world-wide web-based system using the 25 invokes the matcher (300n) which executes the query 
Netscape browsing tool, available from Netscape against the database. 

Corporation, Mountain Mew, Calif. Other versions of the Once the documents have been retrieved and placed in 

GUI client system are possible for other computing envi- folders, the user is given an opportunity to modify the 

ronments. The general features and methods used with the retrieval/foldering criteria (300o) and document display 

GUI and discussed in this application are independent of the 30 criteria (300/?). The user may then select documents (300^) 

computing environment. for printing or downloading (300r), or for the purpose of 

5.1 Typical Screen refining the query (300s— see FIG. 19). If the user has 
FIG. 8 shows a typical GUI screen 280. All GUI screens marked documents deemed by the user to be particularly 

share common features and elements, arranged in a cons is- relevant, the user can invoke the more-like-marked feature 

tent manner for easy navigation. Processing is activated by 35 (300f), which causes the query representation to be modified 

positioning an on-screen cursor using a pointing device, and in view of the documents and the refined query to be rerun, 

using associated buttons to select items, pull-down menus, The user is, at any time, free to initiate a new search request 

or position a text cursor for inputting characters. These (300m), or exit the system (300v). 

common on-screen elements include: A menu bar 280a; a 6.0 User Interaction With the System Before Query is 

navigational toolbar 2806, consisting of a series of buttons 40 Processed 

which each initiate various DR-LINK features, subroutines 6.1 Sign On 

or actions; an options toolbar 280c, allowing the user to FIG. 10 shows the initial screen 330 that appears when a 

specify processing attributes; various on-screen windows user selects the DR-LINK software program for operation. 

280d, in which users can type free-form text and interact The initial screen prompts the user to sign-on using a pop-up 

with the computer system; and various pop-up dialog boxes 45 dialog box 330a. The user is requested to provide a regis- 

280e, which include instructions for typing text in boxes tered username in a field 3306 and unique password code in 

280f, with related pop-up window buttons 280g. At the a field 330c. Only users with registered usernames and valid 

bottom of the screen is a status bar 280/t. In addition, users passwords are allowed to proceed. Once a valid and correct 

are able to select one or several items by clicking on username and password have been entered, the user selects 

selection boxes (e.g., see FIG. 12), or by selecting a check 50 the "Sign On" button 330rf to enter the system. If the user 

button (e.g., FIG. 12). Together, these items allow users to fails to select a valid username or password, the system will 

interact with and navigate through the information retrieval prompt for corrected identification. The "Set Up" button 

system. 330e allows the user to conFig. the nature and type of 

5.2 Sequence Overview modem-based communications between the host computer 
FIGS. 9A and 9B, taken together, provide a flowchart 55 and the remote DR-LINK client computer system, which 

showing a preferred sequence 300 of GUI-based interactions comprises the DR-UNK system outlined earlier in this 

between the DR-LINK system and the user. A total of 22 application. 

specific interactions are shown and many will be described 6.2 Select Databases 

with reference to particular display screens. A number of the FIG. 11 shows a query screen 340 which appears once the 

interactions occur before the query is processed (including 60 user is signed-on to the DR-LINK system. Among the 

login, data selection, and query construction), a number elements of the query screen are a sequence of navigational 

occur after initial query processing but before query execu- toolbar items 340a, 3406, 340c, and 340tf, a sequence of 

tion (query review and feedback), and a number occur after option toolbar items 340e, an instructional window 340/ 

the documents are retrieved (including retrieval and display with hyperlinked online help, a query window 340g for 

criteria selection, the display of relevant documents in 65 entering a free-form query statement, and a status bar 340/z. 

various formats, the marking of relevant documents, the Users of the system can select a range of data sources by 

construction of new, informed queries based on the contents activating the "Select Database" option toolbar button. 
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FIG. 12 shows the "Select Database" screen 350 with 6.5 Write Natural Language Query 

pop-up dialog box 350a. The default setting for database - As stated earlier, DR-LINK makes no requirement that 

selection is "Search all Databases" 350b; users are able to the user state a query using prescribed annotations or logical 

customize the default as required. Alternatively, users can formulations. Instead the query can be stated in fully-formed 

specify which databases are to be searched for a given query 5 natural sentences and the DR-LINK processing modules 

by selecting classes of publications 350c or individual automatically create various representations of the query 

publications 350 d within a class using selection boxes. used for document retrieval. Query statements are entered in 

Databases are arranged as clusters of related source files (see query window 340/ of query screen 340 (FIG. 11). Queries 

later description of the arrangement of categories, databases can be of any length and of any complexity. Mandatory 

and sources). Brief descriptions of selections are shown on 10 requirements can be stated using common phraseology (e.g., 

selection 350e; this description can be amplified by selecting "All documents must mention ..." or "I am only interested 

the "Extended Description" button 350f. Selecting the "OK" in information that specifies . . . "). 

button 350g returns the user to query screen 340. The system is also sensitive to statements of negation, 

6.3 Select Date Range which can also be entered using natural language (e.g., "I am 
FIG. 13 shows a date selection screen 360. From query 15 not interested in ..." or "Documents discussing X are not 

screen 340 (FIG. 11), selecting "Select Date Range" 340/ useful"). Proper nouns can be entered in variant forms (e.g., 

from the options toolbar activates the "Date Range Selec- "Lincoln," "President Lincoln" or "President Abraham 

tion" pop-up window 360a. The default for date range Lincoln") and clarifications or expansions will automatically 

selection is "All Dates Selected" 360Z>, although the default be made. Complex nominals (CNs), noun phrases and other 

can be changed by the user. The selection of dates can be 20 related parts of speech will be recognized, and variant 

over a range using either exact dates 360c and 360d, or by synonymous expressions will automatically be generated, 

selecting pre-determined ranges using radio buttons 360e The subject-content of queries is also captured at the 

and 360/. Activating the "OK" button returns the user to conceptual level by SFC 250. The temporal nature of the 

query screen 340 (FIG. 11). Dates are computed using the query (past, present, future or some combination thereof) is 

document date field identified by the DR-LINK system in 25 also captured using Meta-Phrase Identifier 180. The same 

initial document preprocessing. module also identifies the underlying intention of the query 

6.4 Select Number of Documents (Preference), Scope, (a request for analytic information, evaluation, cause/effect, 
and Dialogs etc.) and this is used for matching purposes. This and other 

Several other retrieval criteria can be selected by the user processing is performed automatically by the query proces- 

using the "Select Additional Options" button from the option 30 ^ (Q p )> described in detail above, 

toolbar. An options pop-up window appears with three 6.6 Spell/Grammar Checking 

folders. The words in the user's query are checked using a 
In the Preferences folder the user can select the number of commercial off the shelf (COTS) spell checking and gram- 
retrieved documents to be returned based on any one or mar checking system. The user is prompted when uniden- 
some combination of the following: the total number of 35 tified words are used in the query, and shown possible 
documents to be retrieved; the total number of documents to correct spellings. A similar technique is used for grammar 
be placed in any of the three folders described earlier; or the checking. 

required effective level of recall, using a novel recall pre- 7.0 User Interaction After Query Processing But Before 

dictor (RP) function. Query Execution 

The RP filtering function is accomplished by means of a 7.1 Review of Your Request 

multiple regression formula that successfully predicts a FIGS. 14A and 14B show the "Review of your Request" 

ranked-list or cut-off criterion for individual queries based screen 370. The specific annotations and representations of 

on the similarity of documents to queries as indicated by the the natural language query statement mentioned above are 

DR-LINK matching scores. The RP is sensitive to the 4S produced by the QP, and once completed, the annotated 

distribution of match scores for a given query. Users are results of the QP are displayed on this screen. The "Review 

asked to state a desired level of recall and a confidence level Your Request" screen encompasses many of the query 

for that level of recall. Using a regression formula, the RP processing (QP) interactions of items 300g through 300/ in 

system is able to compute a cut-off point based on a FIG. 9A. 

lower-bound match-score that, to the stipulated confidence 50 FIG. 14A shows the full "Review of Your Request" screen 

level, will include relevant document to the stipulated recall and contains: A full statement of the user's query 370a; a 

level. While in some cases a relatively large portion of the representation 3706 of the identified proper nouns (PNs), 

retrieved documents would have to be displayed, in most together with related clarifications or expansions; a repre- 

cases for 100% recall with a 95% confidence interval less sentation 370c of complex nominals (CNs), with appropriate 

than 20% of the retrieved document collection need be 55 expansions; a ranked listing 370d of identified subject field 

displayed. In trials (level of recall 100%, confidence level codes (SFCs) for the query; a listing of all terms (PNs, CNs, 

95%) the system has collected an average of 97% of all and single terms) identified in the query identified by the QP, 

documents judged relevant for a given query [Liddy94b]. marked as either mandatory 370* or non-mandatory 37Cjf; 

The Scope folder allows users to specify which databases and a statement 370g of the meta-phrase requirements and 

will be searched for relevant documents. Documents are 6 o temporal aspects. 

arranged in a three-tier file system. At the upper level of the FIG. 14B shows a detail 380 of FIG. 14A, specifically the 

file system are categories (e.g., "Software Prior Art"); within term-based expansions and clarifications, 

categories are databases (e.g., IEEE publications); within The following discussion takes each element of the query 

databases are individual sources (e.g., the IEEE publication representation shown in the Review of your Request screen, 

"Applied Astrophysics"). 65 and describes that representation and the user's ability to 

The Dialogs folder allows users to select the More Like manipulate the system's initial understanding and represen- 

Marked (MLM) relevance feedback feature described later. tation of the query. 
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7.2 Proper Noun Representation button. For a discussion of mandatory and non-mandatory 
Using the functional capabilities of PNC 170, the system . terms, see the discussion below. 

displays to the user all identified proper nouns (including 7.7 Subject Field Code (SFC) Representation 

group proper nouns and group common nouns). The system SFC 250 generates a concept-level description of the 

also generates standard and variant forms of the PN using 5 query's contents. Any of a plurality of subject field codes are 

heuristics and databases (e.g., "IBM" is recognized as hav- assigned to the query statement, based on the disambiguated 

ing various forms, including "International Business codes assigned to each substantive word or phrase in the 

Machines Corp.," "International Business Machines Inc.," ( J uerv * Codes m aUo assigned weights dependent on the 

etc.). In the special case of group proper nouns, the group is relative prevalence of a code in the query. All codes that 

expanded to include all member proper nouns. 10 Ej! te t0 ^ 0011161115 of the q uerv are displayed in window 

The standard form of the proper noun is used as a root, UD ? Cr lh 5 he f dmg "5?^ ^ °. f ^our Request," 

with variant forms indicated as branches from the root (see ™ th COdes f orde t red accordin S to the ^ight assignments. 

FIG. 14B). For example, the reference to "IBM" in the query ^^L^^T Z if ' "7 ^Z?V? 3? 

statement' has the sLdard form "International Business ^^^^^ U ^r • *f P ? ' 

n w n » *u • * c u «i . -1 user can add terr ns to the SFC input screen, and view 

Machines Corp., with vanant forms such as "International is expansion. In an alternative embodiment, the user can also 

Business Machines Inc. branchmg off. Referring to FIG. me relative weights 0 f the codes. 

14B, users are able to select which clarifications and expan- 7.8 Mandatory Term Selection 

sions of the PN are appropriate using selection windows. As discussed in earlier sections of this application, the 

The standard form 380a of the PN can be selected along with DR-LINK system is able to distinguish those aspects of a 

all variants 380fc, or with some combinations of these forms, 20 query that are considered mandatory for retrieval, and 

by marking appropriate selection boxes. Terms are indicated divides the query representation into mandatory and non- 

as selected by a "X" marker in the appropriate box. The user mandatory terms. Relevance is partly determined for docu- 

is able to scroll through all PN representations using the ments using the aspects of the query. Subsequent foldering 

scroll bar 380c. of the documents, and their relative ranking for retrieval, is 

7.3 Complex Nominal Representation 25 based in P art on tne assignment of mandatory tags to terms. 
Complex nominals (adjacent noun pairs/sequences of ^ terms (* NS > CNs ^ siu & c terms) from the query are 

non-predicating adjective and nouns) are detected by the CN displayed in the window labeled "Select the terms that 

detector 190. Variant synonymous CNs are automatically MUST occur ..." in the order in which they appear in the 

created by a process of recombination (whereby a CN such original query text. Terms that the DR-LINK system has 

as "information retrieval software" might produce "infor- 30 deter j nined are mandatory are automatically pre-assigned a 

mation software") or expansion (whereby the CN 3S0d in tag, mdicated by an X in the selection box next 

FIG. 14B, "new products," is expanded to the synonymous 1^%^ If? ' 15 m * f manda ^^ ff^" 

i~m ^ JL*:*,*\ m. • • , pvr ■ u ■ ment for an y or all terms. New terms can also be added to 

SLJTT.iX ^ \^ , gm » " S , h T m " «•* W presentation by using the add terms window and 

T «aL ^ T Alternates as the root phrase or « Ad<r button ^ be PNs> CNs of ^ 

term JsOa, followed by possible recombinations and expan- 35 terms 

sions 380e. The user is able to select which CNs and 8.0 Managing and Interacting with the Retrieved Documents 

expansions are appropriate by placing an "X" marker in the 8.1 Matcher 

appropriate box. The user, having reviewed the QP's analysis, and pro- 

This feature may be added to single term expansion, and vided input as described above, can continue the search by 

other methods, such as statistical thesaurus building. 40 clicking on the "Continue Search Button" 370/i. The user 

7.4 Meta-Phrase Identification can also click the "Return to Request" button 370i and 
Meta-phrase identification (MDI) is the representation of modify the query. Matcher 55 takes the QP-based query 

several, high-level dimensions of meaning or intention in a representation, either unmodified or modified by the user as 

query statement (the analogous process in document pro- described above, and finds suitably similar documents in a 

cessing is text structure). This processing (by meta-phrase 45 range of databases. The matching process involves finding 

identifier 180) is based on discourse theory and labels the similarities or analogues in documents based on 

discourse component requirements of a query using a suite morphological, lexical, syntactic, semantic, discourse, and 

of tags. "Review of Your Request" screen 370 displays all pragmatic level features. The QP produces several variant 

the possible meta-phrase tags assigned to the query repre- representations of the query, using logical structures, SFC- 

sentation under the heading "Request Preferences". This 50 based representation, and other representations of the query 

labeling is not exclusive: Any combination of tags is pos- contents. Matching with documents takes into account the 

sible and tag assignments can be changed or added to by the similarity of documents to a query at the full-document level 

user The tags shown here are a subset of possible discourse- and within subdivisions of the document, called logical 

level requirement tags. paragraphs. Documents are represented in index file 75 with 

7.5 Time Frame Representation 55 the representation being largely similar to the representation 
The meta-phrase identifier also identifies the temporal of the query produced by the DR-LINK QP module. Thus 

requirements of the query statement, which is displayed each document index file has a SFC vector representation, 

under the heading "Time Frame". The temporal sense of the like a query, along with a representation based on terms and 

query is determined using a range of processing tools term expansions, and the presence within the document of 

discussed earlier in this application. Several tags may be 60 various other features and attributes at various levels 

assigned to the query. The user is able to alter the selection (discourse, conceptual, lexical, etc.), as described herein. In 

using the appropriate selection boxes. normal operation the document index file will have been 

7.6 Single Term Representation created prior to the creation of a query. In a current aware- 
Single terms 370e, 370/ are recognized by the DR-LINK ness updating application the processing of the documents is 

system and displayed in the mandatory terms window, along 65 done on the fly and the process or processing and reviewing 

with all PNs and CNs. Users are free to add additional terms queries is done in advance and the query representation 

as appropriate using the add terms window and "Add" stored. 
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The output of the matcher is a ranked lists of documents, 
later assigned to folders. The inclusion of a document within 
a folder is based on various logical requirements (e.g., the 
presence or absence of mandated terms), and the rank 
position of a document within each folder is determined by 5 
a similarity score computed in the DR-LINK matcher. 

8.2 Retrieval/Foldering Criteria 

FIG. 15 shows a retrieved documents screen 380. In the 
initial case documents deemed sufficiently relevant to the 
requirements stated in a query 380a are placed in one of 10 
three folders 380&, 380c, and 380d. In this preferred 
embodiment, the location of a document in a specific folder 
is based on the presence or absence in the document of query 
terms, negation, and mandated terms, as discussed in detail 
above. For example, one set of criteria for a document to be 15 
placed in Folder One is that all query terms (complex 
nominals or expansions, proper nouns or expansions, and 
single terms) match, and no negated terms are present. 
Foldering is performed automatically, based on default or 
user-selected criteria, discussed in detail below. The rank 20 
position of a document within a folder is also computed 
automatically, using similarity scores from matcher 55 for 
the whole document and for logical paragraphs. 

There are three folders in the preferred embodiment 
discussed here. Other arrangements of folders and variant 25 
criteria for matching are possible. Some variations are 
discussed in this application. The full query is restated at the 
top of the screen, with the three folders indicated by tabs and 
stars. The total number of documents in each folder is stated 
on each folder tab (e.g., in FIG. 15 a total of 31 documents 30 
are assigned to Folder One). Documents are shown in 
citation form 380e, with overall rank position, source, date, 
headlineAitle, author and number of pages indicated. Docu- 
ments can be selected by marking the appropriate selection 
box 380f. Other document representations are possible, and 35 
are discussed later in the application. 

In the general case, retrieved documents can be displayed 
using two criteria: Foldering and Sorting. Foldering is the 
process whereby documents are arranged in discrete groups 
according to user-defined criteria. This is the top-level 40 
mechanism for arranging retrieved documents. Any of the 
evidence sources used for document indexing can be used 
alone or in any combination as criteria for foldering. For 
example, folders can be created according to subject field 
codes if the user clicks the "View by Subject" button 380#, 45 
by the presence of various PNs or CNs (e.g., a query 
requesting information about American political leaders 
might folder by Bob Dole, Al Gore, Bill Clinton, etc.), by 
source (e.g., New York Times, The Economist, etc.), or by 
Text Structure. Other foldering criteria are possible. Folder- 50 
ing criteria can be initiated by the user by selecting the 
"View Folders'* menu bar item 380ft, and then for the 
specific case of foldering by SFC, by using the "View by 
Subject'* button on the navigation toolbar. 

"Sorting" is the process whereby documents assigned to 55 
folders are arranged within the folder. Again, any criterion 
that is represented in the document index file or is created by 
the DR-LINK system in response to a query can be used for 
sorting documents (e.g., document date, match score, etc.), 
by having the user select the appropriate item in the "Sort 60 
Folders" menu bar item 380i. 

FIG. 16 shows the screen 390 for foldering according to 
the "View by Subject" criteria. Foldering is created using 
subject field code (SFC) categories. In this case, in the 
preferred embodiment, for all retrieved documents subject 65 
field codes are ranked according to their relative strength in 
the SFC vector. The top three ranked subject field codes for 
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each document are used to determine the most prevalent 
SFCs across all retrieved documents, and folders are created 
in rank order accordingly. Documents are assigned to a 
SFC-based folder according to the relative strength of that 
code in the document's overall SFC vector, using predeter- 
mined or user-selected cut-off criteria. Documents can thus 
appear in a plurality of folders. Sorting within the folder can 
be according to any evidence source found in the index file 
representation of the document. For example, the sorting of 
documents within a folder can be according to the strength 
of a subject field code, date of publication, original rank 
position by matching score, the absence of negation, or any 
combination of these attributes. 

8.3 Document Display Criteria 

As discussed above, retrieved documents can be viewed 
in several different forms by the user. In the initial case, 
documents are displayed in folders in "short form" (see 
FIGS. 15 and 16). Elements of this representation are: Rank 
position by relevance score, beginning with the assignment 
" 1" for the first document in the first folder; the source of the 
document; the author or authors; the headline or other 
summary text of the contents; the original date of publica- 
tion; and the length or size of the document. Documents can 
be selected using the appropriate selection box for Viewing 
in another format (e.g., full text, see below); More Like 
Marked (MLM) relevance feedback (see discussion below); 
or for printing or downloading. 

FIG. 17 shows a screen 400 with articles displayed in 
"summary form." Elements of this representation are: Head- 
ing 400<z, taken from the headline or other summary intro- 
duction to the document; a date field 400J>, showing the 
original date of publication of the document; a source field 
400c; the lead or opening paragraph of the document 400J; 
the most relevant paragraph or section of the document 
400e; a breakdown of the proper nouns represented in the 
document by categories, these categories to include people, 
countries, nationalities, companies, etc.; a list of complex 
nominal and noun phrases that appear in the document, 
useful for modifying or fine-tuning a new query statement; 
and a list of subject field codes, indicating what the general 
subject matter of the document is. Users are able to navigate 
through alternate document representations, or different 
documents, using the buttons 400f at the bottom of the 
window. 

FIG. 18 shows a screen 410 with articles displayed as full 
text. This full-text representation includes a formatted ver- 
sion of the unedited original text from the source document. 
Elements of the full text representation are: "Headline" field 
410a which shows an actual headline or other summary text 
lead to the document; an author field 4106; a date field 
corresponding to the original date of publication 410c; a 
source field 410d; an informational field 41 Oe which 
describes additional information about the document, such 
as copyright restrictions; a document number ("DOC#") and 
document identification ("DOC ID") field 410/ displayed in 
the main text field, which is an internal reference system for 
DR-LINK which uniquely identifies each document in the 
corpus; a display of the full text of the document 410g; and 
a series of buttons 410h by which the user can navigate 
through the system. 

8.4 Selection of Documents/Printing and Saving 
Documents can be selected for downloading or printing at 

the user's client computer system by marking the document. 
Documents can be marked in the "short form" representation 
shown in FIG. 16 by placing an "X" in the appropriate 
selection box 390a, then selecting "Print" 3906 or "Save" 
390c from the navigation toolbar. Documents can be printed 
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to a computer storage device as a digital file, or to a printer useful. The user could be aware of a set of documents, which 

as hard copy. Formatting, options are -available for different had they been published earlier, would be highly relevant 

computer systems and different printer types. Using a unique prior art. By identifying such documents, the user could run 

print option allows users to concatenate a sequence of a query whose representation was based on these docu- 

documents or discrete texts (document summaries, etc.) in a 5 men ts. By limiting the search to an early enough date range, 

single print file, even if the computing environment does not me retrieved documents would be highly likely to represent 

generaUy support such an option highly relevant prior art. 

8.5 Refine Query and More Like Marked (MLM) g 7 j^ ew Recmest 

FIG. 19 shows a screen 420, showing the use of the More a» 9f „ r tU * • % *u ui ♦ 

Like Marked (MLM) function in the user interface. This M *° y tuDe ™ the retneval P rocess r 15 a " e T to 

feature invokes the DR-LINK relevance feedback system, 10 ? f nera * e a ne T " statementby selecUng "New 

whereby the contents of marked documents (or portions of Request froni the navigation toolbar. This takes the user 

documents) considered especially relevant by the user can back t0 #1 m mG ' 9 11 15 &° possible to call-up old, saved 

be used to help formulate a new, revised query statement for queries and rerun them again, 

document retrieval. The MLM retrieval process is similar to 8 8 End 

the retrieval mechanism described for initial query repre- 15 Users can exit the DR-LINK system by selecting from the 

sentation and matching. The revised query is represented by mcnu bar "File," and then from the pull-down menu, "Exit." 

the sum contents of all MLM-selected documents, plus the 9.0 References 

original query representation, using the QP described in this [Liddy93] Liddy, E. D., Paik, W., Yu, E. S. & McVearry, K. 

application. An overview of DR-LINK and its approach to document 

Referring to FIG. 19, documents are selected for the 20 filtering. Proceedings oftheARPA Workshop cm Human 

revised, MLM query statement by selecting documents in Language Technology. Publication date: 1993. 

short form representation by marking the appropriate selec- [Liddy94a] Liddy, E. D. & Myaeng, S. H. (1994). DR-LINK 

tion box 420a, then selecting the "More Like Marked*' tool System: Phase I Summary. Proceedings of the TIPSTER 

4206 from the navigation toolbar. Selecting the MLM func- Phase I Final Report. 

tion from the navigation toolbar instructs the DR-LINK 25 [Liddy94b] Liddy, E. D., Paik, W.,Yu,E. S. & McKenna, M. 

system to reformulate a new query representation based on (1994). Document retrieval using linguistic knowledge, 

the subject-contents of the marked documents, along with Proceedings of R1AO '94 Conference, 

the original query. With the revised query the user may be [Liddy 94c] Liddy, E. D., Paik, W., Yu, E. S. Text catego- 

asked to confirm the query representation, as was the case rization for multiple users based on semantic information 

with the original query, dependent on user-selected prefer- 30 from an MRD. ACM Transactions on Information Sys- 

ence settings. terns. Publication date: 1994. Presentation date: July, 

FIG. 20 shows a screen 420 that is presented once the 1994. 

DR-LINK system has retrieved documents judged to be [Liddy95] Liddy, E. D., Paik, W., McKenna, M. & \u, E. S. 

relevant to the revised query. All the documents are placed (1995) A natural language text retrieval system with 

in a single file 430a marked "More Like Marked." This 35 relevance feedback. Proceedings of the 16th National 

screen shows the original query statement 430ft, along with Online Meeting. 

the retrieved documents in short form representation, ranked [Gentner81] Centner, David. (1981) Some interesting dif- 

according to their relevance score 430c. The system will ferences between verbs and nouns. Cognition and brain 

display the same number of documents that was chosen for theory 4(2), 161-178. 

the original query. Documents in the MLM folder can be 40 [Hanson90] Hanson, Stephen Jose. (1990) Conceptual clus- 

viewed according to any of the display, foldering and sorting tering and categorization: bridging the gap between 

criteria discussed above. induction and causal models. In Yves Kodratoff & 

8.6 Variations on MLM Ryszard Michalski (eds.) Machine Learning, Volume III. 
There are a number of different ways to implement the Morgan Kaufmann Publishers: San Mateo, Calif. 

MLM functionality. First, while the current implementation 45 [Paik93a] Paik, W., liddy, E. D., Yu, E. S. & McKenna, M. 

combines the selected (or marked) document representations Categorizing and standardizing proper nouns for efficient 

with the initial query representation to generate a revised information retrieval. Proceedings of the ACL Workshop 

query representation, it is also possible to base the query on Acquisition of Lexical Knowledge from Text. Publica- 

entirely on the document representations, and ignore the tion date: 1993. 

initial query representation. Additionally, while it is possible 50 [Paik93b] Paik, W., Liddy, E. D., Yu, E. S. & McKenna, M. 

to rely on the stored document representations, it may be Interpretation of Proper Nouns for Information Retrieval, 

more efficient, especially if the user selects only portions of Proceedings oftheARB\ Workshop on Human Language 

a document, to reprocess the selected documents to generate Technology. Publication date: 1993. 

the revised query. In a current implementation, the latter is [Salton89] Salton, Gerald. (1989) Automatic Text Process- 

done. 55 ing. Addison-Westley Publishing: Reading, Mass. 

The MLM functionality gives rise to an additional way to [VanDijk88] VanDijk, Teun A (1988) News Analysis. 

use the DR-LINK processing capabilities. A set of docu- Lawrence Erlbaum Associates: Hillsdale, NJ. 

ments need not arise from running a query using the 10.0 Conclusion 

DR-LINK system. Any collection of relevant documents, In conclusion, the present invention provides a robust and 

including a single document, could be used to formulate a 60 efficient method for implementing an information retrieval 

query to find additional documents like the collection. These system that offers users the opportunity to fully interact with 

documents need only be identified to the DR-LINK system the retrieval process. Specifically, the retrieval system uses 

and processed as if they were MLM documents arising from natural language processing (NLP) techniques to represent, 

a query. If the documents were not in the database, their index, and retrieve texts at the multiple levels (e.g., the 

representations would have to be generated and combined. 65 morphological, lexical, syntactic, semantic, discourse, and 

Prior art searching is an example of an application where pragmatic levels) at which humans construe meaning in 

such a "queryless search" capability could be particularly writing. 
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Using a graphic user interface (GUI), the retrieval system 
interacts with the user to formulate a complex representation . 
of the subject-contents of a query statement expressed in 
fully-formed sentences. Users can state queries as natural 
text of any length or complexity, as if they were expressing 5 
an information need to an expert in the field. The retrieval 
system automatically generates alternative representations 
of the subject-contents of the query, presenting these repre- 
sentations to the user for modification as required. The 
interaction of the user with the underlying query processing 
modules of the retrieval system allows users to state their 1 
information needs in a complex, precise form. 

The described retrieval system also allows the user to 
interact with the retrieval matching engine through a 
flexible, sophisticated system of foldering and sorting. The 
matching of documents to a query is based on a number of 15 
evidence sources. This retrieval system allows users to state 
multiple criteria for retrieving documents and for arranging 
those retrieved documents in rank order within related 
clusters or folders. 

Users are also able to re -state queries using relevance 20 
feedback techniques. In the initial retrieval process, the 
documents deemed highly relevant can be used to reformu- 
late a new, revised query. The subject-contents of marked 
documents are used to generate a new query representation. 

While the above is a complete description of specific 25 
embodiments of the invention, various modifications, 
alterations, alternative constructions, and equivalents can be 
used. For example, the described invention is not restricted 
to operation within certain specified computer 
environments, but is free to operate within a plurality of 30 
computer environments. While the preferred embodiment 
employs a specified range of interactions with the user 
through the GUI, the sequence and number of these inter- 
actions is not essential for operation. 

The evidence sources used to create representations of 35 
texts (documents or queries) is described in specific detail in 
this application. The general method of interaction and 
retrieval is not dependent on all sources of evidence being 
used, or restricted to only those sources described. While a 
specific series of GUI screen illustrations are used in this 40 
application, the method of interaction between the user and 
the underlying retrieval system is not dependent on the 
specific arrangement of elements in each screen and alter- 
native arrangements are possible. 

Therefore, the above description should not be taken as 45 
limiting the scope of the invention as defined by the 
appended claims. 

What is claimed: 

1. A method of operating a computerized information 
retrieval system where information is retrieved from a 50 
database containing documents in response to user queries, 
the method comprising: 

receiving a natural language query specifying information 
to be retrieved; 55 

processing the query to generate an alternative represen- 
tation of the query, the alternative representation 
including generating a logical representation incorpo- 
rating terms and phrases found in the query and assign- 
ing weighted Boolean scores to the query terms; 60 

displaying query information to the user indicating the 
result of said step of processing the query; 

receiving user input responsive to the display of query 
information; 

in response to such user input, if any, modifying the 65 
alternative representation of the query to reflect such 
user input; and 
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matching the alternative representation of the query, 
modified in accordance, with such user input, if any, 
against the database. 

2. The method of claim 1 wherein: 

the alternative representation of the query includes a 
conceptual level defined by subset of a finite set of 
subject codes, the subset corresponding to subject 
codes interpreted by the system to characterize the 
subject contents of the query. 

3. The method of claim 1 wherein 

the user input specifies increasing or decreasing the set of 
terms and phrases. 

4. The method of claim 1 wherein: 

the alternative representation of the query includes a set of 
terms and phrases found in the query that are inter- 
preted by the system to be mandatory requirements for 
relevance; and 

the user input specifies increasing or decreasing the set of 
terms and phrases. 

5. The method of claim 1 wherein: 

the database contains documents that have been processed 
to generate corresponding alternative representations of 
the documents. 

6. The method of claim 5, and further comprising the 
steps, performed after said step of matching the alternative 
representation of the query, of: 

for at least some documents in the database, generating a 
measure of relevance of the document to the query 
using the alternative representation of the document 
and the alternative representation of the of the query; 
and 

displaying a list of documents whose measures of rel- 
evance are deemed sufficiently high. 

7. The method of claim 6, and further comprising the 
steps, performed after said step of displaying a list of 
documents, of: 

receiving user input specifying selection of at least some 
of the list of documents; 

using the alternative representations of the documents, so 
selected, to generate a new query representation; and 

matching the new query representation against the data- 
base. 

8. The method of claim 7 wherein the new query repre- 
sentation is a combination of the alternative representation 
of the query and the alternative representations of the 
documents, so selected. 

9. The method of claim 7 wherein the new query repre- 
sentation does not depend on the alternative representation 
of the query. 

10. A method of operating a computerized information 
retrieval system where information is retrieved from a 
database containing documents in response to user queries, 
the method comprising: 

receiving a natural language query specifying information 
to be retrieved; 

processing the query to abstract the query to an alternative 
representation suitable for input to a database of 
documents, each document of which is abstracted to a 
corresponding alterative representation; 

displaying query information to the user indicating the 
result of said step of processing the query, the query 
information including a number of items characterizing 
the query including a subset of a finite set of subjects, 
the subset corresponding to subjects interpreted by the 
system to characterize the subject contents of the query; 
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entering a mode that permits user modification of the 
items of query information; 

remaining in the mode pending user input that specifies 
exiting the mode, user input that specifies exiting the 
mode including a request to execute the query; 

while in the mode, receiving user modifications, if any, of 
the items of query information; and 

in response to a request to execute the query, executing 
the query, modified in accordance with any user modi- 
fications of the items of query information, against the 
database. 

11. The method of claim 10 wherein user input specifying 
exiting the mode further includes: 

a request to modify the query; and 
a request to input a new query. 

12. The method of claim 10 wherein: 
the query includes a proper noun; 

the query information includes a plurality of variations 
representing an expansion of the proper noun. 

13. The method of claim 10 wherein: 
the query includes a particular term; and 

the query information includes a plurality of variations 
representing an expansion of the particular term. 

14. The method of claim 10 wherein 

the user modifications include increasing or decreasing 
the subset of subjects. 

15. The method of claim 10 wherein: 

the query information is a set of terms and phrases found 
in the query that are interpreted by the system to be 
mandatory requirements for relevance; and 

the user modifications include increasing or decreasing 
the set of terms and phrases. 

16. A method of operating a computerized information 
retrieval system where information is retrieved from a 
database containing documents in response to user queries, 
the method comprising: 

processing a natural language query specifying informa- 
tion to be retrieved to generate an alternative represen- 
tation of the query, the alternative representation 
including generating a logical representation incorpo- 
rating terms and phrases found in the query and assign- 
ing weighted Boolean scores to the query terms; 

matching the alternative representation of the query, 
against the database; 

for at least some documents in the database, generating a 
measure of relevance of the document to the query 
using a common alternative representation of the docu- 
ment and the alternative representation of the query; 
and 

displaying a list of documents whose measures of rel- 
evance are determined by the system to be sufficiently 
high; 

receiving user input specifying selection of at least a 
portion of at least some of the list of documents; 

using the alternative representations of the documents, or 
portions of documents, so selected, to generate a new 
query representation; and 

matching the new query representation against the data- 
base. 

17. The method of claim 16 wherein the new query 
representation is a combination of the alternative represen- 
tation of the query and the alternative representations of the 
documents or portions of documents, so selected. 

18. The method of claim 16 wherein the new query 
representation does not depend on the alternative represen- 
tation of the query. 
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19. The method of claim 16 wherein: 

- the alternative representation of the query includes a 
conceptual level defined by subset of a finite set of 
subject codes, the subset corresponding to subject 
codes interpreted by the system to characterize the 
subject contents of the query; and 

the user input specifies increasing or decreasing the subset 
of subject codes. 

20. The method of claim 16 wherein 

the user input specifies increasing or decreasing the set of 
terms and phrases. 

21. The method of claim 16 wherein: 

the alternative representation of the query is a set of terms 
and phrases found in the query that are interpreted by 
the system to be mandatory requirements for relevance; 
and 

the user input specifies increasing or decreasing the set of 
terms and phrases. 

22. The method of claim 16 wherein: 

the database contains documents that have been processed 
to generate corresponding alternative representations of 
the items in the database. 

23. In a computerized information retrieval system where 
documents in a database are retrieved in response to user 
queries, with each document being processed to produce a 
respective alternative representation and each query being 
processed to provide a corresponding alternative 
representation, the alternative representation of the query 
including generating a logical representation incorporating 
terms and phrases found in the query and assigning weighted 
Boolean scores to the query terms, a method comprising: 

presenting the user with a list of documents, each docu- 
ment having a respective alternative representation; 

receiving user input specifying selection of at least a 
portion of at least some of the list of documents; 

using the alternative representations of the documents, or 
portions of documents, so selected, to generate a query 
representation; and 

matching the query representation against the database. 

24. A method of operating a computerized information 
retrieval system where information is retrieved from a 
database containing documents in response to user queries, 
the method comprising: 

receiving a natural language query specifying information 

to be retrieved; 
extracting terms that appear in the query; 
detecting words that indicate negation of interest in cer- 
tain information; 
if a word that indicates negation is detected, determining 

which of the terms belong to a negative portion of the 

query and which of the terms belong to a positive 

portion of the query; 
generating an alternative representation of the query that 

includes the terms in both the positive and negative 

portions of the query; 
matching the alternative representation of the query 

against the database by determining a measure of 

relevance for each document; 
providing a set of documents that satisfy a retrieval 

criterion; and 

within the set of documents, so provided, segregating 
those documents that satisfy only the positive portion 
of the query from those documents that satisfy both the 
positive and negative portions of the query. 
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25. The method of claim 24 wherein: 

the alternative representation of the query includes a 

plurality of individual representations; 
said step of determining a measure of relevance for each 

document includes, for each document, determining a 

score for each of the individual representations, and 

combining the scores; and 
at least one of the individual representations does not 

include the terms in the negative portion of the query. 

26. A method of operating a computerized information 
retrieval system where information is retrieved from a 
database containing documents in response to user queries, 
the method comprising: 

receiving a natural language query specifying information 
to be retrieved; 

extracting terms that appear in the query; 

detecting words that indicate that a given term is required 

to be in a retrieved document; 
if a word that indicates a mandatory term is detected, 

determining which of the terms are indicated to be 

mandatory; 

generating an alternative representation of the query that 
includes both the terms that are indicated to be man- 
datory and the terms that are not indicated to be 
mandatory, the alternative representation including a 
logical representation wherein the mandatory terms are 
logically connected to the query terms using an AND 
fuzzy Boolean operator; 

matching the alternative representation of the query 
against the database by determining a measure of 
relevance for each document; 

providing a set of documents that satisfy a retrieval 
criterion; and 

within the set of documents, so provided, segregating 
those documents that satisfy the mandatory portion of 
the query from those documents that do not satisfy the 
mandatory portion of the query. 

27. The method of claim 26, and further comprising the 
steps, performed before said matching step, of: 

displaying query information to the user indicating the 
terms that are interpreted by the system to be manda- 
tory; 

receiving user input responsive to the display of such 

query information; and 
in response to such user input, if any, modifying the 

alternative representation of the query to reflect such 

user input. 

28. The method of claim 26 wherein: 
said matching step includes increasing a document's 

measure of relevance if the document satisfies the 
mandatory portion of the query. 

29. A method of operating a computerized information 
retrieval system where information is retrieved from a 
database containing documents in response to user queries, 
the method comprising: 

receiving a natural language query specifying information 
to be retrieved; 

detecting linguistic clues, such as function words and 
punctuation, signifying logical relations among terms 
in the query; 
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generating a logical representation of the query, incorpo- 
rating the terms, linked on the basis of the linguistic 
clues signifying logical relations among the terms; 
assigning a respective weighted Boolean score to each 
term in the logical representation, the weighted Bool- 
ean scores based on the logical relations among the 
query terms; 

comparing the logical representation of the query with a 
document to be scored; 

for each term in the logical representation, (a) if that term 
is found in the document, assigning a possible weight 
corresponding, to the weighted Boolean score to that 
term, and (b) otherwise assigning a zero weight to that 
term; 

combining the weights, so assigned; and 

computing a score based on the weights, so combined. 

30. The method of claim 29 wherein: 
the weighted Boolean scores have maximum values such 

that the score is 1 if all the terms in the logical 
representation are present in the document 

31. The method of claim 29 wherein: 

a weight for ANDed terms is determined as the sum of 
weighted scores for the ANDed terms; and 

a weight for ORed terms is determined as the maximum 
of the weighted scores for the ORed terms. 

32. A method of operating a computerized information 
retrieval system where information is retrieved from a 
database containing documents in response to user queries, 
the method comprising: 

receiving a natural language query specifying information 
to be retrieved; 

processing the query to generate an alternative represen- 
tation of the query; 

processing the documents to generate respective alterna- 
tive representations of the documents, the alternative 
representation of a given document including a con- 
ceptual level defined by subject areas interpreted by the 
system to characterize the subject contents of the 
document; 

matching the alternative representation of the query 

against the database by determining a measure of 

relevance for each document; 
providing of documents meeting a retrieval criterion; 
determining a set of subject areas that are present in the 

conceptual level representations of the documents, so 

provided; 

for at least some subject areas in the set, providing groups 
of documents having that subject area as part of their 
conceptual level representations. 

33. The method of claim 32, and further comprising 
displaying a representation of folders having subject area 
labels. 

34, The method of claim 32 wherein: 

the retrieval criterion requires a document to be in the 
group having the highest measures of relevance. 
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